Texts in Applied Mathematics 63 


TJ. Sullivan 


Introduction 
to Uncertainty 
Quantification 


Springer 



Texts in Applied Mathematics 

Volume 63 


Editors-in-chief: 

Stuart Antman, University of Maryland, College Park, USA 
Leslie Greengard, New York University, New York City, USA 
Philip Holmes, Princeton University, Princeton, USA 


Series Editors: 

John B. Bell, Lawrence Berkeley National Lab, Berkeley, USA 

Joseph B. Keller, Stanford University, Stanford, USA 

Robert Kohn, New York University, New York City, USA 

Paul Newton, University of Southern California, Los Angeles, USA 

Charles Peskin, New York University, New York City, USA 

Robert Pego, Carnegie Mellon University, Pittburgh, USA 

Lenya Ryzhik, Stanford University, Stanford, USA 

Amit Singer, Princeton University, Princeton, USA 

Angela Stevens, Universitat Munster, Munster, Germany 

Andrew Stuart, University of Warwick, Coventry, UK 

Thomas Witelski, Duke University, Durham, USA 

Stephen Wright, University of Wisconsin-Madison, Madison, USA 


More information about this series at http://www.springer.com/series/1214 



T.J. Sullivan 


Introduction to Uncertainty 
Quantification 



Springer 


T.J. Sullivan 
Mathematics Institute 
University of Warwick 
Coventry, UK 


ISSN 0939-2475 ISSN 2196-9949 (electronic) 

Texts in Applied Mathematics 

ISBN 978-3-319-23394-9 ISBN 978-3-319-23395-6 (eBook) 

DOI 10.1007/978-3-319-23395-6 

Library of Congress Control Number: 2015958897 

Mathematics Subject Classification: 65-01, 62-01, 41-01, 42-01, 60G60, 65Cxx, 65J22 

Springer Cham Heidelberg New York Dordrecht London 
(c) Springer International Publishing Switzerland 2015 

This work is subject to copyright. All rights are reserved by the Publisher, whether the 
whole or part of the material is concerned, specifically the rights of translation, reprint- 
ing, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any 
other physical way, and transmission or information storage and retrieval, electronic adap- 
tation, computer software, or by similar or dissimilar methodology now known or hereafter 
developed. 

The use of general descriptive names, registered names, trademarks, service marks, etc. 
in this publication does not imply, even in the absence of a specific statement, that such 
names are exempt from the relevant protective laws and regulations and therefore free for 
general use. 

The publisher, the authors and the editors are safe to assume that the advice and informa- 
tion in this book are believed to be true and accurate at the date of publication. Neither 
the publisher nor the authors or the editors give a warranty, express or implied, with re- 
spect to the material contained herein or for any errors or omissions that may have been 
made. 

Printed on acid-free paper 

Springer International Publishing AG Switzerland is part of Springer Science+Business 
Media (www.springer.com) 


For N.T.K. 



Preface 


This book is designed as a broad introduction to the mathematics of Un- 
certainty Quantification (UQ) at the fourth year (senior) undergraduate or 
beginning postgraduate level. It is aimed primarily at readers from a math- 
ematical or statistical (rather than, say, engineering) background. The main 
mathematical prerequisite is familiarity with the language of linear functional 
analysis and measure / probability theory, and some familiarity with basic 
optimization theory. Chapters 2-5 of the text provide a review of this mate- 
rial, generally without detailed proof. 

The aim of this book has been to give a survey of the main objectives in 
the field of UQ and a few of the mathematical methods by which they can 
be achieved. However, this book is no exception to the old saying that books 
are never completed, only abandoned. There are many more UQ problems 
and solution methods in the world than those covered here. For any grievous 
omissions, I ask for your indulgence, and would be happy to receive sugges- 
tions for improvements. With the exception of the preliminary material on 
measure theory and functional analysis, this book should serve as a basis 
for a course comprising 30-45 hours’ worth of lectures, depending upon the 
instructor’s choices in terms of selection of topics and depth of treatment. 

The examples and exercises in this book aim to be simple but informative 
about individual components of UQ studies: practical applications almost 
always require some ad hoc combination of multiple techniques (e.g., Gaus- 
sian process regression plus quadrature plus reduced-order modelling). Such 
compound examples have been omitted in the interests of keeping the pre- 
sentation of the mathematical ideas clean, and in order to focus on examples 
and exercises that will be more useful to instructors and students. 

Each chapter concludes with a bibliography, the aim of which is threefold: 
to give sources for results discussed but not proved in the text; to give some 
historical overview and context; and, most importantly, to give students a 
jumping-off point for further reading and research. This has led to a large 
bibliography, but hopefully a more useful text for budding researchers. 
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Chapter 1 

Introduction 


We must think differently about our ideas — 
and how we test them. We must become more 
comfortable with probability and uncertainty. 
We must think more carefully about the as- 
sumptions and beliefs that we bring to a 
problem. 


The Signal and the Noise: The Art of 
Science and Prediction 
Nate Silver 


1.1 What is Uncertainty Quantification? 

This book is an introduction to the mathematics of Uncertainty Quantifi- 
cation (UQ), but what is UQ? It is, roughly put, the coming together of 
probability theory and statistical practice with The real world’. These two 
anecdotes illustrate something of what is meant by this statement: 

• Until the early-to-mid 1990s, risk modelling for catastrophe insurance 
and re-insurance (i.e. insurance for property owners against risks aris- 
ing from earthquakes, hurricanes, terrorism, etc., and then insurance for 
the providers of such insurance) was done on a purely statistical basis. 
Since that time, catastrophe modellers have tried to incorporate models 
for the underlying physics or human behaviour, hoping to gain a more 
accurate predictive understanding of risks by blending the statistics and 
the physics, e.g. by focussing on what is both statistically and physically 
reasonable. This approach also allows risk modellers to study interesting 
hypothetical scenarios in a meaningful way, e.g. using a physics-based 
model of water drainage to assess potential damage from rainfall 10% in 
excess of the historical maximum. 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-1 
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• Over roughly the same period of time, deterministic engineering mod- 
els of complex physical processes began to incorporate some element of 
uncertainty to account for lack of knowledge about important physical 
parameters, random variability in operating circumstances, or outright 
ignorance about what the form of a ‘correct’ model would be. Again the 
aim is to provide more accurate predictions about systems’ behaviour. 
Thus, a ‘typical’ UQ problem involves one or more mathematical models for 
a process of interest, subject to some uncertainty about the correct form 
of, or parameter values for, those models. Often, though not always, these 
uncertainties are treated probabilistically. 

Perhaps as a result of its history, there are many perspectives on what 
UQ is, including at the extremes assertions like “UQ is just a buzzword for 
statistics” or “UQ is just error analysis”. These points of view are somewhat 
extremist, but they do contain a kernel of truth: very often, the probabilistic 
theory underlying UQ methods is actually quite simple, but is obscured by 
the details of the application. However, the complications that practical app- 
lications present are also part of the essence of UQ: it is all very well giving 
an accurate prediction for some insurance risk in terms of an elementary 
mathematical object such as an expected value, but how will you actually go 
about evaluating that expected value when it is an integral over a million- 
dimensional parameter space? Thus, it is important to appreciate both the 
underlying mathematics and the practicalities of implementation, and the 
presentation here leans towards the former while keeping the latter in mind. 

Typical UQ problems of interest include certification, prediction, model 
and software verification and validation, parameter estimation, data assimi- 
lation, and inverse problems. At its very broadest, 

“UQ studies all sources of error and uncertainty, including the following: system- 
atic and stochastic measurement error; ignorance; limitations of theoretical models; 
limitations of numerical representations of those models; limitations of the accuracy 
and reliability of computations, approximations, and algorithms; and human error. 

A more precise definition is UQ is the end-to-end study of the reliability of scientific 
inferences.” (U.S. Department of Energy, 2009, p. 135) 

It is especially important to appreciate the “end-to-end” nature of UQ 
studies: one is interested in relationships between pieces of information , not 
the ‘truth’ of those pieces of information/assumptions, bearing in mind that 
they are only approximations of reality. There is always going to be a risk of 
‘Garbage In, Garbage Out’. UQ cannot tell you that your model is ‘right’ or 
‘true’, but only that, if you accept the validity of the model (to some quanti- 
fied degree), then you must logically accept the validity of certain conclusions 
(to some quantified degree). In the author’s view, this is the proper interpre- 
tation of philosophically sound but somewhat unhelpful assertions like “Veri- 
fication and validation of numerical models of natural systems is impossible” 
and “The primary value of models is heuristic” (Oreskes et ah, 1994). UQ 
can, however, tell you that two or more of your modelling assumptions are 
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mutually contradictory, and hence that your model is wrong, and a complete 
UQ analysis will include a met a- analysis examining the sensitivity of the 
original analysis to perturbations of the governing assumptions. 

A prototypical, if rather over-used, example for UQ is an elliptic PDE with 
uncertainty coefficients: 

Example 1.1. Consider the following elliptic boundary value problem on a 
connected Lipschitz domain A C M n (typically n — 2 or 3): 

-V-(ftVft) = / in A, (1.1) 

u = b on dX. 

Problem (1.1) is a simple but not overly nai've model for the pressure field u 
of some fluid occupying a domain X. The domain A consists of a material, 
and the tensor field ft: A M nXn describes the permeability of this material 
to the fluid. There is a source term / : X R, and the boundary condition 
specifies the values b : dX R that the pressure takes on the boundary of X. 
This model is of interest in the earth sciences because Darcy’s law asserts that 
the velocity field v of the fluid flow in this medium is related to the gradient 
of the pressure field by 

v = ftVft. 

If the fluid contains some kind of contaminant, then it may be important to 
understand where fluid following the velocity field v will end up, and when. 

In a course on PDE theory, you will learn that, for each given positive- 
definite and essentially bounded permeability field ft, problem (1.1) has a 
unique weak solution u in the Sobolev space Hq(X) for each forcing term / 
in the dual Sobolev space H~ l {X). This is known as the forward problem. 
One objective of this book is to tell you that this is far from the end of 
the story! As far as practical applications go, existence and uniqueness of 
solutions to the forward problem is only the beginning. For one thing, this 
PDE model is only an approximation of reality. Secondly, even if the PDE 
were a perfectly accurate model, the ‘true’ ft, / and b are not known precisely, 
so our knowledge about u = ft (ft, /, b) is also uncertain in some way. If ft, / 
and b are treated as random variables, then u is also a random variable, 
and one is naturally interested in properties of that random variable such 
as mean, variance, deviation probabilities, etc. This is known as the forward 
propagation of uncertainty , and to perform it we must build some theory for 
probability on function spaces. 

Another issue is that often we want to solve an inverse problem : perhaps 
we know something about /, b and u and want to infer ft via the relationship 
(1.1). For example, we may observe the pressure u(xi) at finitely many points 
Xi G X\ This problem is hugely underdetermined, and hence ill-posed; ill- 
posedness is characteristic of many inverse problems, and is only worsened 
by the fact that the observations may be corrupted by observational noise. 
Even a prototypical inverse problem such as this one is of enormous practical 
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interest: it is by solving such inverse problems that oil companies attempt to 
infer the location of oil deposits in order to make a profit, and seismologists 
the structure of the planet in order to make earthquake predictions. Both 
of these problems, the forward and inverse propagation of uncertainty, fall 
under the very general remit of UQ. Furthermore, in practice, the domain 
X and the fields /, 6, k and u are all discretized and solved for numerically 
(i.e. approximately and finite-dimensionally), so it is of interest to understand 
the impact of these discretization errors. 

Epistemic and Aleatoric Uncertainty. It is common to divide uncer- 
tainty into two types, aleatoric and epistemic uncertainty. Aleatoric uncer- 
tainty — from the Latin alea, meaning a die — refers to uncertainty about 
an inherently variable phenomenon. Epistemic uncertainty — from the Greek 
CTuarriiJLip meaning knowledge — refers to uncertainty arising from lack of 
knowledge. If one has at hand a model for some system of interest, then epis- 
temic uncertainty is often further subdivided into model form uncertainty, in 
which one has significant doubts that the model is even ‘structurally correct’, 
and parametric uncertainty, in which one believes that the form of the model 
reflects reality well, but one is uncertain about the correct values to use for 
particular parameters in the model. 

To a certain extent, the distinction between epistemic and aleatoric un- 
certainty is an imprecise one, and repeats the old debate between frequentist 
and subjectivist (e.g. Bayesian) statisticians. Someone who was simultane- 
ously a devout Newtonian physicist and a devout Bayesian might argue that 
the results of dice rolls are not aleatoric uncertainties — one simply doesn’t 
have complete enough information about the initial conditions of die, the 
material and geometry of the die, any gusts of wind that might affect the 
flight of the die, and so forth. On the other hand, it is usually clear that 
some forms of uncertainty are epistemic rather than aleatoric: for example, 
when physicists say that they have yet to come up with a Theory of Every- 
thing, they are expressing a lack of knowledge about the laws of physics in 
our universe, and the correct mathematical description of those laws. In any 
case, regardless of one’s favoured interpretation of probability, the language 
of probability theory is a powerful tool in describing uncertainty. 

Some Typical UQ Objectives. Many common UQ objectives can be illus- 
trated in the context of a system, E, that maps inputs X in some space X to 
outputs Y = F(X) in some space y. Some common UQ objectives include: 

• The forward propagation or push-forward problem. Suppose that the un- 
certainty about the inputs of F can be summarized in a probability distri- 
bution fi on X. Given this, determine the induced probability distribution 
E*/i on the output space A, as defined by 


(E*/i)(E) := P M ({* G A | F(x) G E}) = P '»[F(X) G E\. 
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This task is typically complicated by /x being a complicated distribution, 
or F being non-linear. Because (F*/x) is a very high-dimensional object, 
it is often more practical to identify some specific outcomes of interest 
and settle for a solution of the following problem: 

• The reliability or certification problem. Suppose that some set Afaii T y 
is identified as a ‘failure set’, i.e. the outcome F(X) E 34aii is undesirable 
in some way. Given appropriate information about the inputs X and 
forward process F, determine the failure probability, 

P »[F(X) E 3?fail] • 

Furthermore, in the case of a failure, how large will the deviation from 
acceptable performance be, and what are the consequences? 

• The prediction problem. Dually to the reliability problem, given a maxi- 
mum acceptable probability of error e > 0, find a set 34 C y such that 

p ^[F(x)ey £ \ >i-e. 

i.e. the prediction F(X) E 34 is wrong with probability at most e. 

• An inverse problem , such as state estimation (often for a quantity that 
is changing in time) or parameter identification (usually for a quantity 
that is not changing, or is non-physical model parameter). Given some 
observations of the output, T, which may be corrupted or unreliable in 
some way, attempt to determine the corresponding inputs X such that 
F(X) = Y . In what sense are some estimates for X more or less reliable 
than others? 

• The model reduction or model calibration problem. Construct another 
function Fh (perhaps a numerical model with certain numerical parame- 
ters to be calibrated , or one involving far fewer input or output variables) 
such that Fh ~ F in an appropriate sense. Quantifying the accuracy of 
the approximation may itself be a certification or prediction problem. 

Sometimes a UQ problem consists of several of these problems coupled 
together: for example, one might have to solve an inverse problem to produce 
or improve some model parameters, and then use those parameters to propa- 
gate some other uncertainties forwards, and hence produce a prediction that 
can be used for decision support in some certification problem. 

Typical issues to be confronted in addressing these problems include the 
high dimension of the parameter spaces associated with practical problems; 
the approximation of integrals (expected values) by numerical quadrature; 
the cost of evaluating functions that often correspond to expensive computer 
simulations or physical experiments; and non-negligible epistemic uncertainty 
about the correct form of vital ingredients in the analysis, such as the func- 
tions and probability measures in key integrals. 

The aim of this book is to provide an introduction to the fundamen- 
tal mathematical ideas underlying the basic approaches to these types of 
problems. Practical UQ applications almost always require some ad hoc 
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combination of multiple techniques, adapted and specialized to suit the cir- 
cumstances, but the emphasis here is on basic ideas, with simple illustrative 
examples. The hope is that interested students or practitioners will be able 
to generalize from the topics covered here to their particular problems of int- 
erest, with the help of additional resources cited in the bibliographic discus- 
sions at the end of each chapter. So, for example, while Chapter 12 discusses 
intrusive (Galerkin) methods for UQ with an implicit assumption that the 
basis is a polynomial chaos basis, one should be able to adapt these ideas to 
non-polynomial bases. 

A Word of Warning. UQ is not a mature field like linear algebra or single- 
variable complex analysis, with stately textbooks containing well-polished 
presentations of classical theorems bearing August names like Cauchy, Gauss 
and Hamilton. Both because of its youth as a field and its very close eng- 
agement with applications, UQ is much more about problems, methods and 
‘good enough for the job’. There are some very elegant approaches within 
UQ, but as yet no single, general, over- arching theory of UQ. 


1.2 Mathematical Prerequisites 

Like any course or text, this book has some prerequisites. The perspective on 
UQ that runs through this book is strongly (but not exclusively) grounded 
in probability theory and Hilbert spaces, so the main prerequisite is familiar- 
ity with the language of linear functional analysis and measure/probability 
theory. As a crude diagnostic test, read the following sentence: 

Given any cr-finite measure space (A, J+ /x), the set of all ^-measurable functions 
/: X — )> C for which f x |/| 2 d/x is finite, modulo equality fi - almost everywhere, is a 
Hilbert space with respect to the inner product (f,g) := f x fgdfi. 

None of the symbols, concepts or terms used or implicit in that sentence 
should give prospective students or readers any serious problems. Chapters 2 
and 3 give a recap, without proof, of the necessary concepts and results, and 
most of the material therein should be familiar territory. In addition, Chap- 
ters 4 and 5 provide additional mathematical background on optimization 
and information theory respectively. It is assumed that readers have greater 
prior familiarity with the material in Chapters 2 and 3 than the material in 
Chapters 4 and 5; this is reflected in the way that results are presented mostly 
without proof in Chapters 2 and 3, but with proof in Chapters 4 and 5. 

If, in addition, students or readers have some familiarity with topics such as 
numerical analysis, ordinary and partial differential equations, and stochas- 
tic analysis, then certain techniques, examples and remarks will make more 
sense. None of these are essential prerequisites, but, some ability and willing- 
ness to implement UQ methods — even in simple settings — in, e.g., C/C++, 
Mathematica, Matlab, or Python is highly desirable. (Some of the concepts 
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Fig. 1.1: Outline of the book (Leitfaden). An arrow from m to n indicates 
that Chapter n substantially depends upon material in Chapter m. 


covered in the book will be given example numerical implementations in 
Python.) Although the aim of this book is to give an overview of the mathe- 
matical elements of UQ, this is a topic best learned in the doing, not through 
pure theory. However, in the interests of accessibility and pedagogy, none 
of the examples or exercises in this book will involve serious programming 
legerdemain. 


1.3 Outline of the Book 

The first part of this book lays out basic and general mathematical tools 
for the later discussion of UQ. Chapter 2 covers measure and probability 
theory, which are essential tools given the probabilistic description of many 
UQ problems. Chapter 3 covers some elements of linear functional analysis 
on Banach and Hilbert spaces, and constructions such as tensor products, all 
of which are natural spaces for the representation of random quantities and 
fields. Many UQ problems involve a notion of Test fit’, and so Chapter 4 pro- 
vides a brief introduction to optimization theory in general, with particular 
attention to linear programming and least squares. Finally, although much of 
the UQ theory in this book is probabilistic, and is furthermore an L 2 theory, 
Chapter 5 covers more general notions of information and uncertainty. 
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The second part of the book is concerned with mathematical tools that 
are much closer to the practice of UQ. We begin in Chapter 6 with a mathe- 
matical treatment of inverse problems, and specifically their Bayesian inter- 
pretation; we take advantage of the tools developed in Chapters 2 and 3 to 
discuss Bayesian inverse problems on function spaces, which are especially 
important in PDE applications. In Chapter 7, this leads to a specific class of 
inverse problems, filtering and data assimilation problems, in which data and 
unknowns are decomposed in a sequential manner. Chapter 8 introduces or- 
thogonal polynomial theory, a classical area of mathematics that has a double 
application in UQ: orthogonal polynomials are useful basis functions for the 
representation of random processes, and form the basis of powerful numer- 
ical integration (quadrature) algorithms. Chapter 9 discusses these quadra- 
ture methods in more detail, along with other methods such as Monte Carlo. 
Chapter 10 covers one aspect of forward uncertainty propagation, namely 
sensitivity analysis and model reduction, i.e. finding out which input par- 
ameters are influential in determining the values of some output process. 
Chapter 11 introduces spectral decompositions of random variables and other 
random quantities, including but not limited to polynomial chaos methods. 
Chapter 12 covers the intrusive (or Galerkin) approach to the determination 
of coefficients in spectral expansions; Chapter 13 covers the alternative non- 
intrusive (sample-based) paradigm. Finally, Chapter 14 discusses approaches 
to probability-based UQ that apply when even the probability distributions 
of interest are uncertain in some way. 

The conceptual relationships among the chapters are summarized in Figure 

1 . 1 . 


1.4 The Road Not Taken 

There are many topics relevant to UQ that are either not covered or discussed 
only briefly here, including: detailed treatment of data assimilation beyond 
the confines of the Kalman filter and its variations; accuracy, stability and 
computational cost of numerical methods; details of numerical implementa- 
tion of optimization methods; stochastic homogenization and other multiscale 
methods; optimal control and robust optimization; machine learning; issues 
related to ‘big data’; and the visualization of uncertainty. 


Chapter 2 

Measure and Probability Theory 


To be conscious that you are ignorant is a 
great step to knowledge. 


Sybil 

Benjamin Disraeli 


Probability theory, grounded in Kolmogorov’s axioms and the general 
foundations of measure theory, is an essential tool in the quantitative mathe- 
matical treatment of uncertainty. Of course, probability is not the only frame- 
work for the discussion of uncertainty: there is also the paradigm of interval 
analysis, and intermediate paradigms such as Dempster-Shafer theory, as 
discussed in Section 2.8 and Chapter 5. 

This chapter serves as a review, without detailed proof, of concepts from 
measure and probability theory that will be used in the rest of the text. 
Like Chapter 3, this chapter is intended as a review of material that should 
be understood as a prerequisite before proceeding; to an extent, Chapters 2 
and 3 are interdependent and so can (and should) be read in parallel with 
one another. 


2.1 Measure and Probability Spaces 

The basic objects of measure and probability theory are sample spaces, which 
are abstract sets; we distinguish certain subsets of these sample spaces as 
being ‘measurable’, and assign to each of them a numerical notion of ‘size’. 
In probability theory, this size will always be a real number between 0 and 1, 
but more general values are possible, and indeed useful. 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-2 
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2 Measure and Probability Theory 


Definition 2.1. A measurable space is a pair (A, J^), where 

(a) A is a set, called the sample space ; and 

(b) & is a cr-algebra on A, i.e. a collection of subsets of A containing 0 
and closed under countable applications of the operations of union, in- 
tersection and complementation relative to A; elements of & are called 
measurable sets or events. 

Example 2.2. (a) On any set A, there is a trivial cr-algebra in which the 
only measurable sets are the empty set 0 and the whole space A. 

(b) On any set A, there is also the power set cr-algebra in which every subset 
of A is measurable. It is a fact of life that this cr-algebra contains too 
many measurable sets to be useful for most applications in analysis and 
probability. 

(c) When A is a topological — or, better yet, metric or normed — space, 
it is common to take dE to be the Borel cr-algebra £&(X), the smallest 
cr-algebra on A so that every open set (and hence also every closed set) 
is measurable. 

Definition 2.3. (a) A signed measure (or charge ) on a measurable space 
(A, J^) is a function fi: & -0 MU{=boo} that takes at most one of the two 
infinite values, has /x(0) = 0, and, whenever E\, E 2, . . . E & are pairwise 
disjoint with union E E then g(E) = g{E n ). In the case that 

H(E) is finite, we require that the series M-^n) converges absolutely 

to g(E). 

(b) A measure is a signed measure that does not take negative values. 

(c) A probability measure is a measure such that /x(A) = 1. 

The triple (A, J^, g) is called a signed measure space , measure space , or 
probability space as appropriate. The sets of all signed measures, measures, 
and probability measures on (A,J^) are denoted A!±(A,j^), J\A+( A,J^), 
and *Mi(A, J^) respectively. 

Example 2.4. (a) The trivial measure can be defined on any set A and 
cr-algebra: r(E) := 0 for every E E ^ . 

(b) The unit Dirac measure at a E A can also be defined on any set A and 
cr-algebra: 



1, if a E E, E E 
0, if a ^ E, E E & . 


(c) Similarly, we can define counting measure : 



n, if E E & is a finite set with exactly n elements, 

+ 00 , if E E & is an infinite set. 


(d) Lebesgue measure on M n is the unique measure on M n (equipped with 
its Borel cr-algebra ^(M n ), generated by the Euclidean open balls) that 
assigns to every rectangle its n-dimensional volume in the ordinary sense. 


2.1 Measure and Probability Spaces 
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To be more precise, Lebesgue measure is actually defined on the com- 
pletion ^ 0 (R n ) of ^(M n ), which is a larger a - algebra than ^(W 1 ). The 
rigorous construction of Lebesgue measure is a iron-trivial undertaking, 
(e) Signed measures/charges arise naturally in the modelling of distributions 
with positive and negative values, e.g. p(E) = the net electrical charge 
within some measurable region E CM 3 . They also arise naturally as 
differences of non-negative measures: see Theorem 2.24 later on. 

Remark 2.5. Probability theorists usually denote the sample space of a 
probability space by 17; PDE theorists often use the same letter to denote a 
domain in M n on which a partial differential equation is to be solved. In UQ, 
where the worlds of probability and PDE theory often collide, the possibility 
of confusion is clear. Therefore, this book will tend to use O for a probability 
space and T for a more general measurable space, which may happen to be 
the spatial domain for some PDE. 

Definition 2.6. Let (T, J^,//) be a measure space. 

(a) If N C X is a subset of a measurable set E E & such that fi(E) = 0, 
then N is called a p-null set. 

(b) If the set of x E X for which some property P(x) does not hold is //-null, 
then P is said to hold fi- almost everywhere (or, when // is a probability 
measure, /jl- almost surely). 

(c) If every //-null set is in fact an ^-measurable set, then the measure space 

//) is said to be complete. 

Example 2.7. Let (T, //) be a measure space, and let / : X — > R be some 

function. If f(x) > t for //-almost every x E T, then t is an essential lower 
bound for /; the greatest such t is called the essential infimum of /: 

ess inf / := sup {t E M | / > t //-almost everywhere} . 

Similarly, if f(x) < t for //-almost every x E T, then t is an essential upper 
bound for /; the least such t is called the essential supremum of /: 

ess sup / := inf {t E R | / < t //-almost everywhere} . 

It is so common in measure and probability theory to need to refer to 
the set of all points x E X such that some property P[x) holds true that 
an abbreviated notation has been adopted: simply [P]. Thus, for example, if 
/ : X R is some function, then 

[f <t}:={x € X | f(x) < t}. 

As noted above, when the sample space is a topological space, it is usual 
to use the Borel cr-algebra (i.e. the smallest cr-algebra that contains all the 
open sets); measures on the Borel cr-algebra are called Borel measures. Unless 
noted otherwise, this is the convention followed here. 
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2 Measure and Probability Theory 



Fig. 2.1: The probability simplex Ali({l, 2, 3}), drawn as the triangle spanned 
by the unit Dirac masses <^, i E {1, 2, 3}, in the vector space of signed mea- 
sures on {1, 2, 3}. 


Definition 2.8. The support of a measure p defined on a topological space 
X is 

supp(/i) := o C X | F is closed and p(X \F) = 0}. 

That is, supp(/i) is the smallest closed subset of X that has full //-measure. 
Equivalently, supp(//) is the complement of the union of all open sets of p- 
measure zero, or the set of all points x E X for which every neighbourhood 
of x has strictly positive //-measure. 

Especially in Chapter 14, we shall need to consider the set of all probability 
measures defined on a measurable space. M.\(X) is often called the probability 
simplex on T. The motivation for this terminology comes from the case in 
which X — {1, . . . , n} is a finite set equipped with the power set cr-algebra, 
which is the same as the Borel cr-algebra for the discrete topology on X. 1 In 
this case, functions /: X —> R are in bijection with column vectors 

7(i)' 

_/ ( n )_ 

and probability measures p on the power set of X are in bijection with the 
(n — l)-dimensional set of row vectors 


Ml 1 )) ••• MM) 


1 It is an entertaining exercise to see what pathological properties can hold for a probability 
measures on a cr-algebra other than the power set of a finite set A\ 


2.1 Measure and Probability Spaces 
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such that > 0 for alH G {1, . . . , n} and Y17=i MW) = 1- As illustrated 

in Figure 2.1, the set of such fi is the (n — l)-dimensional simplex in M n that 
is the convex hull of the n points 5i, . . . , 5 n , 


* 


0 •••() 1 0 ••• 0 


with 1 in the i th column. Looking ahead, the expected value of / under fi 
(to be defined properly in Section 2.3) is exactly the matrix product: 


E d/] = 52 = <h /> 

i=l 


MW) 


MW) 


/(l) 

/W 


It is useful to keep in mind this geometric picture of Ali(A) in addition to the 
algebraic and analytical properties of any given (i G Ali(A). As poetically 
highlighted by Sir Michael Atiyah (2004, Paper 160, p. 7): 

“Algebra is the offer made by the devil to the mathematician. The devil says: ‘I 
will give you this powerful machine, it will answer any question you like. All you 
need to do is give me your soul: give up geometry and you will have this marvellous 
machine.’ ” 


Or, as is traditionally but perhaps apocryphally said to have been inscribed 
over the entrance to Plato’s Academy: 

ATEQMETPHTOE MHAEIE EIEITQ 

In a sense that will be made precise in Chapter 14, for any ‘nice’ space 
A, A^i (A) is the simplex spanned by the collection of unit Dirac measures 
{5 X | x G A}. Given a bounded, measurable function f : A — y IR. and c G M., 

W £ A1(A) | E m [/] < c} 

is a half-space of A4(A), and so a set of the form 

{/ i e Mi(X) | E m [/i] < ci, . . • ,E M [/ m ] < c m j 

can be thought of as a polytope of probability measures. 

One operation on probability measures that must frequently be performed 
in UQ applications is conditioning, i.e. forming a new probability measure 
n( • | B) out of an old one fi by restricting attention to subsets of a measurable 
set B. Conditioning is the operation of supposing that B has happened, 
and examining the consequently updated probabilities for other measurable 
events. 

Definition 2.9. If (0, 2^, fi) is a probability space and B G & has > 0, 
then the conditional probability measure /i( • | B) on ((9,2^) is defined by 
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KE\B) 


n{EnB) 

KB) 


for E G 


The following theorem on conditional probabilities is fundamental to sub- 
jective (Bayesian) probability and statistics (q.v. Section 2.8: 

Theorem 2.10 (Bayes’ rule). If (0^,fi) is a probability space and A, 
have fi(A),fi(B) > 0, then 


fjb(B\A)/jb(A) 

Both the definition of conditional probability and Bayes’ rule can be ext- 
ended to much more general contexts (including cases in which fi{B) = 0) 
using advanced tools such as regular conditional probabilities and the disinte- 
gration theorem. In Bayesian settings, fi{A) represents the ‘prior’ probability 
of some event A, and p(A\B) its ‘posterior’ probability, having observed some 
additional data B. 



2.2 Random Variables and Stochastic Processes 

Definition 2.11. Let (A, J^) and (y,&) be measurable spaces. A function 
/ : A y generates a cr-algebra on A by 

Kf) ~a({[f £E]\Eg&}), 

and / is called a measurable function if cr(f) Q & . That is, / is measur- 
able if the pre-image f~ 1 (E) of every ^-measurable subset E of y is an 
measurable subset of A. A measurable function whose domain is a prob- 
ability space is usually called a random variable. 

Remark 2.12. Note that if & is the power set of A, or if is the trivial 
cr-algebra {0, A}, then every function f : X y is measurable. At the oppo- 
site extreme, if & is the trivial cr-algebra {0, A}, then the only measurable 
functions /: A -0 y are the constant functions. Thus, in some sense, the 
sizes of the cr-algebras used to define measurability provide a notion of how 
well- or ill-behaved the measurable functions are. 

Definition 2.13. A measurable function /: A -0 y from a measure space 
fi) to a measurable space (y,&) defines a measure /*/i on (y,&), 
called the push-forward of p by /, by 

(U)(E) : =Klf ZE\), iovEe^. 

When fi is a probability measure, /*/x is called the distribution or law of the 
random variable /. 


2.3 Lebesgue Integration 
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Definition 2.14. Let S be any set and let ((9, /i) be a probability space. 

A function U: S x O A such that each £7(s, •) is a random variable is 
called an X -valued stochastic process on S. 

Whereas measurability questions for a single random variable are discussed 
in terms of a single cr-algebra, measurability questions for stochastic processes 
are discussed in terms of families of cr-algebras; when the indexing set S is 
linearly ordered, e.g. by the natural numbers, or by a continuous parameter 
such as time, these families of cr-algebras are increasing in the following sense: 

Definition 2.15. (a) A filtration of a cr-algebra & is a family = {J^ | 
i E 1} of sub-cr-algebras of indexed by an ordered set /, such that 

i < j in I => C 

(b) The natural filtration associated with a stochastic process U : I x O A 
is the filtration defined by 

:= <7 ({£7 (j, • )~ 1 {E) C O | E C A is measurable and j < i}). 

(c) A stochastic process U is adapted to a filtration if C for 
each i E I. 

Measurability and adaptedness are important properties of stochastic pro- 
cesses, and loosely correspond to certain questions being ‘answerable’ or ‘dec- 
idable’ with respect to the information contained in a given cr-algebra. For 
instance, if the event [X E E\ is not ^-measurable, then it does not even 
make sense to ask about the probability P M [A E E]. For another example, 
suppose that some stream of observed data is modelled as a stochastic pro- 
cess Y , and it is necessary to make some decision U (t) at each time t. It is 
common sense to require that the decision stochastic process be J^f-adapted, 
since the decision U(t) must be made on the basis of the observations Y’(s), 
s < £, not on observations from any future time. 


2.3 Lebesgue Integration 

Integration of a measurable function with respect to a (signed or non- 
negative) measure is referred to as Lebesgue integration. Despite the many 
technical details that must be checked in the construction of the Lebesgue int- 
egral, it remains the integral of choice for most mathematical and probabilis- 
tic applications because it extends the simple Riemann integral of functions 
of a single real variable, can handle worse singularities than the Riemann 
integral, has better convergence properties, and also naturally captures the 
notion of an expected value in probability theory. The issue of numerical 
evaluation of integrals — a vital one in UQ applications — will be addressed 
separately in Chapter 9. 
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The construction of the Lebesgue integral is accomplished in three steps: 
first, the integral is defined for simple functions, which are analogous to step 
functions from elementary calculus, except that their plateaus are not inter- 
vals in R but measurable events in the sample space. 

Definition 2.16. Let (X,dE,g) be a measure space. The indicator function 
Ie of a set E E & is the measurable function defined by 


Ie(x) 


1, if x E E 
0, if x ^ E. 


A function / : X K is called simple if 

n 

f = T a i^-Ei 

i — 1 


for some scalars aa, . . . , a n E DC and some pairwise disjoint measurable sets 
Ei , . . . , E n E with p(Ei) finite for i = 1, . . . , n. The Lebesgue integral of a 
simple function / := Y27=i OL i^-E i is defined to be 


n 


/ d/z := ^ atin(Ei) 


i — 1 


In the second step, the integral of a non- negative measurable function is 
defined through approximation from below by the integrals of simple func- 
tions: 

Definition 2.17. Let (X, g) be a measure space and let /: X [0, Too] 
be a measurable function. The Lebesgue integral of / is defined to be 



:= sup 



<f>: X R is a simple function, and 
0 < (j)(x) < f(x) for /i-almost all x E X 


Finally, the integral of a real- or complex- valued function is defined through 
integration of positive and negative real and imaginary parts, with care being 
taken to avoid the undefined expression c oo — oo’: 

Definition 2.18. Let (X , LE , g) be a measure space and let /: X R be a 
measurable function. The Lebesgue integral of / is defined to be 


f d/i := / f+d/i- f- d/i 


provided that at least one of the integrals on the right-hand side is finite. The 
integral of a complex- valued measurable function / : X C is defined to be 


f dg := f (Re f) dg + i f (Imf)dg. 
J x J x 


2.3 Lebesgue Integration 
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The Lebesgue integral satisfies all the natural requirements for a useful 
notion of integration: integration is a linear function of the integrand, inte- 
grals are additive over disjoint domains of integration, and in the case A — R 
every Riemann-integrable function is Lebesgue integrable. However, one of 
the chief attractions of the Lebesgue integral over other notions of integration 
is that, subject to a simple domination condition, pointwise convergence of 
integrands is enough to ensure convergence of integral values: 

Theorem 2.19 (Dominated convergence theorem) . Let (A, be a mea- 

sure space and let f n : X — > DC be a measurable function for each n E N. If 
f: X K is such that lim n ^oo/n(^) = f(x) for every x E X and there 
is a measurable function g: X [0, oo] such that f x \g\ d/a is finite and 
I fn (x) | < g(x) for all x G X and all large enough n E N, then 



lim 

n— >• oo 



fnd/J,. 


Furthermore, if the measure space is complete, then the conditions on point- 
wise convergence and pointwise domination of f n (x) can be relaxed to hold 
g-almost everywhere. 

As alluded to earlier, the Lebesgue integral is the standard one in proba- 
bility theory, and is used to define the mean or expected value of a random 
variable: 

Definition 2.20. When ((9, is a probability space and A : O DC is 

a random variable, it is conventional to write E M [A] for J 0 X{6) dfi{0) and 
to call E m [X] the expected value or expectation of X. Also, 


V»[X] :=E m 


A 


E m [X] 


M\ x \ 2 } 


[X ]\ 2 


is called the variance of X. If A is a K^-valued random variable, then E M [A], 
if it exists, is an element of K . d , and 


C:=E M [(X-E Al [X])(X-E Al [X])*] 

G IK 


i.e. Cij := E m 


(Xi ~ E m [Xi]) (Xj — E m [Xj ] ) 


is the covariance matrix of A. 

Spaces of Lebesgue-integrable functions are ubiquitous in analysis and 
probability theory: 

Definition 2.21. Let (A, be a measure space. For 1 < p < oo, the L p 

space (or Lebesgue space ) is defined by 

L p (A,/qK) := {/ : A — K. | / is measurable and ||/||lp(^) is finite}. 
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For 1 < p < oo, the norm is defined by the integral expression 

||/I|lp(m) : = (j x \f(x)\ p dn(x)) ; (2.1) 

for p = oo, the norm is defined by the essential supremum (cf. Example 2.7) 


II/IIl~( A[ ) : = esssup|/(a:)| (2.2) 

xOX 

= inf {IMI OO \f = g- x -X K. /i-almost everywhere} 

= inf {t > 0 | |/| < t /i-almost everywhere} . 

To be more precise, L P (X, //; K) is the set of equivalence classes of such func- 
tions, where functions that differ only on a set of fi - measure zero are identified. 

When ((9, J^, p) is a probability space, we have the containments 


l<p<q<oo => L P [S , fi; R) D L q (0 , /x; R). 


Thus, random variables in higher-order Lebesgue spaces are ‘better behaved’ 
than those in lower-order ones. As a simple example of this slogan, the fol- 
lowing inequality shows that the iT-norm of a random variable X provides 
control on the probability X deviates strongly from its mean value: 

Theorem 2.22 (Chebyshev’s inequality). Let X E L p {0 , /x; IK) ; 1 < p < oo, 
be a random variable. Then, for all t >0, 




E„[A-]| >t] < 



(2.3) 


(The case p = 1 is also known as Markov’s inequality.) It is natural to ask 
if (2.3) is the best inequality of this type given the stated assumptions on X , 
and this is a question that will be addressed in Chapter 14, and specifically 
Example 14.18. 

Integration of Vector- Valued Functions. Lebesgue integration of func- 
tions that take values in M n can be handled componentwise, as indeed was 
done above for complex- valued integrands. However, many UQ problems con- 
cern random fields, i.e. random variables with values in infinite-dimensional 
spaces of functions. For definiteness, consider a function / defined on a mea- 
sure space (X , LX , /x) taking values in a Banach space V. There are two ways 
to proceed, and they are in general inequivalent: 

(a) The strong integral or Bochner integral of / is defined by integrating 
simple V-valued functions as in the construction of the Lebesgue integral, 
and then defining 



whenever ( <f> n )ne_N is a sequence of simple functions such that the (scalar- 
valued) Lebesgue integral f x \\f — <f> n || d/x converges to 0 as n -o oo. 


2.4 Decomposition and Total Variation of Signed Measures 
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It transpires that / is Bochner integrable if and only if ||/|| is Lebesgue 
integrable. The Bochner integral satisfies a version of the Dominated Con- 
vergence Theorem, but there are some subtleties concerning the Radon- 
Nikodym theorem. 

(b) The weak integral or Pettis integral of / is defined using duality: f x f d /i 
is defined to be an element v E V such that 

(£ | v) = f (£ | f(x)) dfi(x) for all f G V 7 . 

J x 

Since this is a weaker integrability criterion, there are naturally more 
Pettis-integrable functions than Bochner-integrable ones, but the Pettis 
integral has deficiencies such as the space of Pettis-integrable functions 
being incomplete, the existence of a Pettis-integrable function /: [0, 1] — > 
V such that F(t) := Jj 0 ^ f(r) dr is not differentiable (Kadets, 1994), and 
so on. 


2.4 Decomposition and Total Variation of Signed 
Measures 

If a good mental model for a non-negative measure is a distribution of mass, 
then a good mental model for a signed measure is a distribution of electrical 
charge. A natural question to ask is whether every distribution of charge can 
be decomposed into regions of purely positive and purely negative charge, and 
hence whether it can be written as the difference of two non-negative distri- 
butions, with one supported entirely on the positive set and the other on the 
negative set. The answer is provided by the Hahn and Jordan decomposition 
theorems. 

Definition 2.23. Two non- negative measures (i and i/ona measurable space 
(T, J^) are said to be mutually singular , denoted fi T v, if there exists E E & 
such that fi(E) = \ E) = 0. 

Theorem 2.24 (Hahn-Jordan decomposition). Let p be a signed measure 
on a measurable space 

(a) Hahn decomposition: there exist sets P, TV E & such that P U N = X , 
P P\ N = 0 , and 

for all measurable E C P, /a(E) > 0, 
for all measurable E C N , p(E) < 0. 

This decomposition is essentially unique in the sense that if P' and N' 
also satisfy these conditions, then every measurable subset of the sym- 
metric differences PAP' and NAN' is of p-measure zero. 
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(b) Jordan decomposition: there are unique mutually singular non-negative 
measures p+ and p- on (X , JE), at least one of which is a finite measure, 
such that p = /x_|_ — p- ; indeed, for all E E JE , 

t*+(E) = v{Er\P), 

H-(E) = -ii(EnN). 

From a probabilistic perspective, the main importance of signed measures 
and their Hahn and Jordan decompositions is that they provide a useful 
notion of distance between probability measures: 

Definition 2.25. Let p be a signed measure on a measurable space (A, J^), 
with Jordan decomposition p = p+ — p-. The associated total variation 
measure is the non-negative measure \p\ := p+ + p-. The total variation of 
p is 1 1 /i 1 1 tv := \p\(X). 

Remark 2.26. (a) As the notation ||/x||tv suggests, || • ||tv is a norm on the 
space A4±(T, j^~) of signed measures on (A, J^). 

(b) The total variation measure can be equivalently defined using measurable 
partitions: 

{ n 

E \v(Ei) 

i = 1 

(c) The total variation distance between two probability measures p and v 
(i.e. the total variation norm of their difference) can thus be character- 
ized as 

dxv(/i, v) = || p — v||tv = 2sup{|/i(F?) — v(E)\ | E E ^}, (2.4) 

i.e. twice the greatest absolute difference in the two probability values 
that p and v assign to any measurable event E. 


n E No , Ei , . . . , E n E 
and E = E\ U • • • U E n 


2.5 The Radon— Nikodym Theorem and Densities 

Let (T,J^,/r) be a measure space and let p: T [0, Too] be a measurable 
function. The operation 

v\E^r / p(x) dp(x) (2.5) 

J E 

defines a measure v on (T,^). It is natural to ask whether every measure 
v on (T, J^) can be expressed in this way. A moment’s thought reveals that 
the answer, in general, is no: there is no such function p that will make (2.5) 
hold when p and v are Lebesgue measure and a unit Dirac measure (or vice 
versa) on R. 


2.6 Product Measures and Independence 
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Definition 2.27. Let p and v be measures on a measurable space (A, J^). 
If, for E G v(E) = 0 whenever p{E) = 0, then v is said to be absolutely 
continuous with respect to p, denoted v <C /a. If v <C p <C v, then p and v 
are said to be equivalent , and this is denoted p ~ v. 

Definition 2.28. A measure space (X,^,p) is said to be a -finite if A 
can be expressed as a countable union of ^-measurable sets, each of finite 
p- measure. 

Theorem 2.29 (Radon-Nikodym). Suppose that p and v are a -finite mea- 
sures on a measurable space (X,dE) and that v <C p. Then there exists a 
measurable function p: X [0, oo] such that, for all measurable functions 
f : A R and all E G SE , 


/ fp&h 

J E J E 

whenever either integral exists. Furthermore, any two functions p with this 
property are equal p-almost everywhere. 

The function p in the Radon-Nikodym theorem is called the Radon- 
Nikodym derivative of v with respect to p, and the suggestive notation p = ^ 

is often used. In probability theory, when v is a probability measure, ^ is 
called the probability density function (PDF) of v (or any ^-distributed ran- 
dom variable) with respect to p. Radon-Nikodym derivatives behave very 
much like the derivatives of elementary calculus: 

Theorem 2.30 (Chain rule). Suppose that p, v and i r are a -finite measures 
on a measurable space (X, SE) and that tt ^ n ^ p. Then tt ^ p and 

dn dn do 

—— = - — — p- almost everywhere, 
dp do dp 

Remark 2.31. The Radon-Nikodym theorem also holds for a signed mea- 
sure v and a non-negative measure p , but in this case the absolute continuity 
condition is that the total variation measure \v\ satisfies \v\ <C p , and of 
course the density p is no longer required to be a non-negative function. 


2.6 Product Measures and Independence 

The previous section considered one way of making new measures from old 
ones, namely by re- weighting them using a locally integrable density func- 
tion. By way of contrast, this section considers another way of making new 
measures from old, namely forming a product measure. Geometrically speak- 
ing, the product of two measures is analogous to ‘area’ as the product of 
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two ‘length’ measures. Products of measures also arise naturally in probabil- 
ity theory, since they are the distributions of mutually independent random 
variables. 

Definition 2.32. Let ((9, J^,/r) be a probability space. 

(a) Two measurable sets (events) E\,E 2 G & are said to be independent if 
p(E ± H E 2 ) = p{E 1 )/j J (E 2 ). 

(b) Two sub-cr-algebras and % of LX are said to be independent if E\ and 
E 2 are independent events whenever E\ G and E 2 G ^ 2 - 

(c) Two measurable functions (random variables) X : O X and Y : O y 
are said to be independent if the cr-algebras generated by X and Y are 
independent. 

Definition 2.33. Let (T,J^,/r) and {y ,v) be cr-finite measure spaces. 
The product cr -algebra & 0 is the a- algebra on X x y that is generated 
by the measurable rectangles, i.e. the smallest cr-algebra for which all the 
products 

F x G, Fe^,Ge&, 

are measurable sets. The product measure fi (8) v\ LX (8) -G [0, + 00 ] is the 
measure such that 

(fi (8) v)(F x G) = p(F)v(G), for all F G G G Sf . 

In the other direction, given a measure on a product space, we can consider 
the measures induced on the factor spaces: 

Definition 2.34. Let (X x y,^,/a) be a measure space and suppose that 
the factor space X is equipped with a cr-algebra such that the projections 
Tlx'- (%,y) ^ x is a measurable function. Then the marginal measure fix is 
the measure on X defined by 

fi X (E) := ((n x )^)(E) = fi(E x y). 

The marginal measure fly on y is defined similarly. 

Theorem 2.35. Let X = (Xi,X 2 ) be a random variable taking values in a 
product space X = X\ x X 2 - Let p be the (joint) distribution of X , and /q the 
(marginal) distribution of Xi for i — 1,2. Then X\ and X 2 are independent 
random variables if and only if /a = pi (8) fi 2 . 

The important property of integration with respect to a product measure, 
and hence taking expected values of independent random variables, is that it 
can be performed by iterated integration: 

Theorem 2.36 (Fubini-Tonelli) . Let (X,LE,p) and (y ,v) be a -finite 
measure spaces, and let f : X x y -G [0, + 00 ] be measurable. Then, of the 
following three integrals, if one exists in [0, oo] ; then all three exist and are 
equal: 
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f(x,y) di y{y) dy(x), 




f(x, y) dy(x) d v{y), 


and 


xxy 


f{x,y)d{y®v){x,y). 


Infinite product measures (or, put another way, infinite sequences of inde- 
pendent random variables) have some interesting extreme properties. Infor- 
mally, the following result says that any property of a sequence of independent 
random variables that is independent of any finite subcollection (i.e. depends 
only on the ‘infinite tail’ of the sequence) must be almost surely true or 
almost surely false: 


Theorem 2.37 (Kolmogorov zero-one law). Let (X n ) ne ?q be a sequence of 
independent random variables defined over a probability space (0, J^,/r) ; and 
let SE n := a(X n ). For each n E N, let Tf n := <r( [j k>n SEk) , an d let 


&-.= n ^= n <r(x n ,x n+1 ,...) cjr 

be the so-called tail cr-algebra. Then, for every E E ST , p(E) E {0, 1}. 

Thus, for example, it is impossible to have a sequence of real- valued ran- 
dom variables (X n ) ne ?q such that linq^oo X n exists with probability either 
the sequence converges with probability one, or else with probability one it 
has no limit at all. There are many other zero-one laws in probability and 
statistics: one that will come up later in the study of Monte Carlo averages 
is Kesten’s theorem (Theorem 9.17). 


2.7 Gaussian Measures 

An important class of probability measures and random variables is the class 
of Gaussians, also known as normal distributions. For many practical prob- 
lems, especially those that are linear or nearly so, Gaussian measures can 
serve as appropriate descriptions of uncertainty; even in the nonlinear sit- 
uation, the Gaussian picture can be an appropriate approximation, though 
not always. In either case, a significant attraction of Gaussian measures is 
that many operations on them (e.g. conditioning) can be performed using 
elementary linear algebra. 

On a theoretical level, Gaussian measures are particularly important bec- 
ause, unlike Lebesgue measure, they are well defined on infinite-dimensional 
spaces, such as function spaces. In R d , Lebesgue measure is characterized up 
to normalization as the unique Borel measure that is simultaneously 
• locally finite, i.e. every point of has an open neighbourhood of finite 
Lebesgue measure; 
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• strictly positive, i.e. every open subset ofR d has strictly positive Lebesgue 
measure; and 

• translation invariant, i.e. A (x + E) = X(E) for all x eR d and measurable 
E C R d . 

In addition, Lebesgue measure is cr-finite. However, the following theorem 
shows that there can be nothing like an infinite-dimensional Lebesgue 
measure: 

Theorem 2.38. Let /a be a Borel measure on an infinite- dimensional Banach 
space V, and, for v G V, let T v : V V be the translation map T v (x) := v + x. 

(a) If p is locally finite and invariant under all translations, then p is the 
trivial (zero) measure. 

(b) If p is cr-finite and quasi- invariant under all translations (i.e. (T v )*p is 
equivalent to p), then p is the trivial (zero) measure. 

Gaussian measures on R d are defined using a Radon-Nikodym derivative 
with respect to Lebesgue measure. To save space, when P is a self-adjoint 
and positive-definite matrix or operator on a Hilbert space (see Section 3.3), 
write 


(x,y) P := (x,Py) = (P 1/2 x,P 1/2 y), 

||x||p := \J ( x , x) p = HP 1 / 2 ®!! 


for the new inner product and norm induced by P. 

Definition 2.39. Let m G R d and let C G R dxd be symmetric and positive 
definite. The Gaussian measure with mean m and covariance C is denoted 
J\f(m , C) and defined by 


A r(m,C)(E) 


1 

Vdet CV2tt 

1 

Vdet C\/2 f d 



f (x — m) • C 1 (t 

V 2 “ 



dx 



x — m\\ c - 1 


dx 


for each measurable set E CR d . The Gaussian measure 7 := jV(0, 1) is called 
the standard Gaussian measure. A Dirac measure S m can be considered as a 
degenerate Gaussian measure on R, one with variance equal to zero. 

A non-degenerate Gaussian measure is a strictly positive probability mea- 
sure on R d , i.e. it assigns strictly positive mass to every open subset of R d ; 
however, unlike Lebesgue measure, it is not translation invariant: 

Lemma 2.40 (Cameron-Martin formula). Let p = A f(m,C) be a Gaussian 
measure on R d . Then the push-forward (T v )*p of p by translation by any 
v G R d , i.e. A f(m + v , C), is equivalent to Af(m , C ) and 
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i.e., for every integrable function f , 

I f {x -\- v) dg{x) = / f{x) exp f (v, x — m) c -i — 7 : || ) d/i(x). 

J R d J R d V 2 J 

It is easily verified that the push-forward of J\f{m, C) by any linear func- 
tional £ : -G R is a Gaussian measure on R, and this is taken as the defining 

property of a general Gaussian measure for settings in which, by Theorem 
2.38, there may not be a Lebesgue measure with respect to which densities 
can be taken: 

Definition 2.41. A Borel measure fi on a normed vector space V is said 
to be a {non- degenerate) Gaussian measure if, for every continuous linear 
functional £: V -G R, the push-forward measure £*/i is a (non-degenerate) 
Gaussian measure on R. Equivalently, p is Gaussian if, for every linear map 
T: V -G M d , T*/i = J\f{mT-,CT) for some mr G and some symmetric 
positive-definite Ct G M^ xd . 

Definition 2.42. Let fi be a probability measure on a Banach space V. An 
element G V is called the mean of fi if 


[ {£ | x — ra M ) dfi{x) = 0 for all £ G !/, 

Jv 


so that J v x d/i(x) = in the sense of a Pettis integral. If = 0, then fi is 
said to be centred. The covariance operator is the self-adjoint (i.e. conjugate- 
symmetric) operator : P x V 7 G K defined by 

C fJL (k,£) = j (k\x — ra M ) (£\x — ra M ) dfi{x) for all k,£ e V'. 

Jv 

We often abuse notation and write (7^ : V 7 -G V" for the operator defined by 

(C„k\e) :=C^{k,£) 


In the case that V = TL is a Hilbert space, it is usual to employ the Riesz 
representation theorem to identify TL with 1-i' and TL" and hence treat as 
a linear operator from TL into itself. The inverse of C M , if it exists, is called 
the precision operator of fi. 

The covariance operator of a Gaussian measure is closely connected to its 
non- degeneracy: 

Theorem 2.43 (Vakhania, 1975). Let p be a Gaussian measure on a 
separable, reflexive Banach space V with mean G V and covariance 
operator : V gV. Then the support of p is the affine subspace of V that 
is the translation by the mean of the closure of the range of the covariance 
operator, i.e. 


supp (n) = + C^V. 
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Corollary 2.44. For a Gaussian measure /a on a separable , reflexive Banach 
space V, the following are equivalent: 

(a) /a is non- degenerate; 

(b) V'-> V is one-to-one; 

(c) C^V = V. 

Example 2.45. Consider a Gaussian random variable X = (Xi, X 2 ) ~ /x 
taking values in M 2 . Suppose that the mean and covariance of X (or, equiv- 
alently, fi) are, in the usual basis of M 2 , 



0 

c = 

1 

0 

m = 



1 


0 

0 


Then X = (Z, 1), where Z ~ AC(0, 1) is a standard Gaussian random variable 
on R; the values of X all he on the affine line L := {(aq, X 2 ) G M 2 | X 2 = 1}. 
Indeed, Vakhania’s theorem says that 


supp (fi) = m + C(M 2 ) = 


+ 


X\ 

0 


x\ G 


= L. 


Gaussian measures can also be identified by reference to their Fourier 
transforms: 

Theorem 2.46. A probability measure p on V is a Gaussian measure if and 
only if its Fourier transform fi: V' C satisfies 

fi(£) := J e 1 ^ I x ^ d/a(x) = exp ^ ' i(£ \ m) — ^ for all £ G V' . 

for some m G V and some positive- definite quadratic form Q onV ' . Indeed, m 
is the mean of p and Q(£) = C^{£,£). Furthermore, if two Gaussian measures 
p and v have the same mean and covariance operator, then p = v. 

Not only does a Gaussian measure have a well-defined mean and variance, 
it in fact has moments of all orders: 

Theorem 2.47 (Fernique, 1970). Let p be a centred Gaussian measure on 
a separable Banach space V. Then there exists a > 0 such that 



exp(a||x|| 2 ) d /i(x) < + 00 . 


A fortiori, fi has moments of all orders: for all k > 0, 



k d/a(x) < + 00 . 


The covariance operator of a Gaussian measure on a Hilbert space Ti is 
a self-adjoint operator from Ti into itself. A classification of exactly which 
self-adjoint operators on TL can be Gaussian covariance operators is provided 
by the next result, Sazonov’s theorem: 
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Definition 2.48. Let K : TL — > TL be a linear operator on a separable Hilbert 
space TL. 

(a) K is said to be compact if it has a singular value decomposition, i.e. if 
there exist finite or countably infinite orthonormal sequences (u n ) and 
(v n ) in TL and a sequence of non- negative reals (cr n ) such that 

K — ^ ^ (Pm ' )^n? 

n 

with lim n ^oo (J n = 0 if the sequences are infinite. 

(b) K is said to be trace class or nuclear if cr n is finite, and Hilbert- 

Schmidt or nuclear of order 2 if a f is finite. 

(c) If K is trace class, then its trace is defined to be 

tr (K) 

n 

for any orthonormal basis (e n ) of TL, and (by Lidskii’s theorem) this 
equals the sum of the eigenvalues of K, counted with multiplicity. 

Theorem 2.49 (Sazonov, 1958). Let /a be a centred Gaussian measure on a 
separable Hilbert space H. Then : TL TL is trace class and 


tr(C M ) = / \\x \\ 2 d/j,(x). 

Jn 

Conversely, if K : TL TL is positive, self-adjoint and of trace class, then 
there is a Gaussian measure /a on TL such that = K . 

1/2 

Sazonov’s theorem is often stated in terms of the square root C of C 

is Hilbert-Schmidt, i.e. has square-summable singular values (cr n ) nG n- 
As noted above, even finite-dimensional Gaussian measures are not invari- 
ant under translations, and the change-of-measure formula is given by Lemma 
2.40. In the infinite-dimensional setting, it is not even true that translation 
produces a new measure that has a density with respect to the old one. This 
phenomenon leads to an important object associated with any Gaussian mea- 
sure, its Cameron-Martin space: 

Definition 2.50. Let fi = A f(m, C ) be a Gaussian measure on a Banach 
space V. The Cameron-Martin space is the Hilbert space TL M defined equiv- 
alently by: 

• TL^ is the completion of 


{W 


for some h* £ V',C(h*, •) 


<•!&>} 


with respect to the inner product (h, fc) M := C(h*, k*). 
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• TL^ is the completion of the range of the covariance operator C : V' V 
with respect to this inner product (cf. the closure with respect to the 
norm in V in Theorem 2.43). 

• If V is Hilbert, then Ti^ is the completion of ran C 1//2 with the inner 
product (h,k)c- 1 •= {C~ 1 ^ 2 h, C _1 / 2 /c)y. 

• TL^ is the set of all v G V such that ( T v )*p ~ p, with 

)*M/ \ /^/ \ ll^ll c 

d - — \ x ) = ex P ( \v, x )c-i | 

as in Lemma 2.40. 

• l-ifji is the intersection of all linear subspaces of V that have full /i-measure. 

By Theorem 2.38, if p is any probability measure (Gaussian or otherwise) 
on an infinite-dimensional space V, then we certainly cannot have TL M = V. 
In fact, one should think of TL^ as being a very small subspace of V: if TL^ 
is infinite dimensional, then p(TL m ) = 0. Also, infinite-dimensional spaces 
have the extreme property that Gaussian measures on such spaces are either 
equivalent or mutually singular — there is no middle ground in the way that 
Lebesgue measure on [0, 1] has a density with respect to Lebesgue measure 
on R but is not equivalent to it. 

Theorem 2.51 (Feldman-Hajek). Let p, v be Gaussian probability measures 
on a normed vector space V. Then either 

• /a and v are equivalent, i.e. p{E) = 0 <^=4> v(E) = 0, and hence each 
has a strictly positive density with respect to the other; or 

• p and v are mutually singular, i.e. there exists E such that p(E) = 0 and 
v(E) = 1, and so neither p nor v can have a density with respect to the 
other. 

Furthermore, equivalence holds if and only if 

(a) ranC M = ranG^ ; 

(b) — m u G ran C]J 2 = ran Cl 2 ; and 

(c) T := (Cn ll2 Cl'\Cn 1/2 Cl /2 Y — I is Hilbert-Schmidt in ran C}J 2 . 

The Cameron-Martin and Feldman-Hajek theorems show that translation 
by any vector not in the Cameron-Martin space Ti M C V produces a new 
measure that is mutually singular with respect to the old one. It turns out 
that dilation by a non-unitary constant also destroys equivalence: 

Proposition 2.52. Let p be a centred Gaussian measure on a separable real 
Banach space V such that dim TL^ = oo. For c E R, let D c : V V be the 
dilation map D c (x) := ex. Then (D c )*p is equivalent to p if and only if 
c G {±1} ; and (D c )*p and p are mutually singular otherwise. 

Remark 2.53. There is another attractive viewpoint on Gaussian measures 
on Hilbert spaces, namely that draws from a Gaussian measure A f(m, C ) on 
a Hilbert space are the same as draws from random series of the form 
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m + E 

ke N 

where {if k}keN are orthonormal eigenvectors for the covariance operator C, 
are the corresponding eigenvalues, and {£/c}/cgn are independent 
draws from the standard normal distribution Af( 0 , 1 ) on M. This point of view 
will be revisited in more detail in Section 11.1 in the context of Karhunen- 
Loeve expansions of Gaussian and Besov measures. 

The conditioning properties of Gaussian measures can easily be expressed 
using an elementary construction from linear algebra, the Schur complement. 
This result will be very useful in Chapters 6 , 7 , and 13 . 

Theorem 2.54 (Conditioning of Gaussian measures). Let Li = Hi be a 
direct sum of separable Hilbert spaces. Let X = (Xi,^) ^ p be an H-valued 
Gaussian random variable with mean m = (mi, m2) and positive- definite 
covariance operator C. For i, j = 1 , 2 , let 


Cij(ki , kj) := (ki,x- m i )(k jl x - m. 


(2.6) 


for all ki G Hi, kj G Hj, so that C is decomposed? in block form as 


C = 


Cu 

C21 


Ci 2 
C22 


( 2 . 7 ) 


in particular, the marginal distribution of Xi is A f(mi,Ca), and C21 = C* 2 . 
Then C22 is invertible and, for each X2 G H2, the conditional distribution of 
X\ given X2 = X2 is Gaussian: 


(X\\X2 — X2) ~ Af(m\ + C\2C 2 2 (x2 — m2), Cn — C 12 C 22 1 C2i). ( 2 . 8 ) 


2.8 Interpretations of Probability 

It is worth noting that the above discussions are purely mathematical: a 
probability measure is an abstract algebraic-analytic object with no neces- 
sary connection to everyday notions of chance or probability. The question 
of what interpretation of probability to adopt, i.e. what practical meaning 
to ascribe to probability measures, is a question of philosophy and math- 
ematical modelling. The two main points of view are the frequentist and 
Bayesian perspectives. To a frequentist, the probability p(E) of an event E 
is the relative frequency of occurrence of the event E in the limit of infinitely 
many independent but identical trials; to a Bayesian, p(E) is a numerical 

2 Here we are again abusing notation to conflate Cij : Tii © Tij — >> K defined in (2.6) with 
Cij : Hj Hi given by {C^ (kj), ki)ui = Cij(ki,kj). 
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representation of one’s degree of belief in the truth of a proposition E. The 
frequentist’s point of view is objective ; the Bayesian’s is subjective ; both use 
the same mathematical machinery of probability measures to describe the 
properties of the function fi. 

Frequentists are careful to distinguish between parts of their analyses that 
are fixed and deterministic versus those that have a probabilistic character. 
However, for a Bayesian, any uncertainty can be described in terms of a 
suitable probability measure. In particular, one’s beliefs about some unknown 
6 (taking values in a space G) in advance of observing data are summarized 
by a prior probability measure n on G. The other ingredient of a Bayesian 
analysis is a likelihood function , which is up to normalization a conditional 
probability: given any observed datum y, L(y\0) is the likelihood of observing 
y if the parameter value 6 were the truth. A Bayesian’s belief about 6 given 
the prior i r and the observed datum y is the posterior probability measure 
7 r( • | y) on (9, which is just the conditional probability 

(f) \ \ = My\ Odg) = L(y\0)n(6) 

[ W> E*[L(#)] /©£(#) d7r(C 

or, written in a way that generalizes better to infinite-dimensional (9, we have 
a density/Radon-Nikodym derivative 

« Lm- 

Both the previous two equations are referred to as Bayes ; rule , and are at 
this stage informal applications of the standard Bayes’ rule (Theorem 2.10) 
for events A and B of non-zero probability. 

Example 2.55. Parameter estimation provides a good example of the philo- 
sophical difference between frequentist and subjectivist uses of probability. 
Suppose that Xi, . . . ,X n are n independent and identically distributed ob- 
servations of some random variable X, which is distributed according to the 
normal distribution A /*(0, 1) of mean 6 and variance 1. We set our frequen- 
tist and Bayesian statisticians the challenge of estimating 0 from the data 
d:= (X 1 ,...,X n ). 

(a) To the frequentist, 6 is a well-defined real number that happens to be 
unknown. This number can be estimated using the estimator 



2 = 1 


which is a random variable. It makes sense to say that 0 n is close to 0 
with high probability, and hence to give a confidence interval for 0, but 
0 itself does not have a distribution. 
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(b) To the Bayesian, 6 is a random variable , and its distribution in advance 
of seeing the data is encoded in a prior i r. Upon seeing the data and 
conditioning upon it using Bayes’ rule, the distribution of the parameter 
is the posterior distribution 7r(0\d). The posterior encodes everything that 
is known about 6 in view of tt, L(y\6) oc e \ y / 2 and d, although this 
information may be summarized by a single number such as the maximum 
a posteriori estimator 

^ MAP := argmax7r(0|d) 

deR 

or the maximum likelihood estimator 

^ MLE := argmax L{d\6). 

oeR 

The Bayesian perspective can be seen as the natural extension of classical 
Aristotelian bivalent (i.e. true-or-false) logic to propositions of uncertain 
truth value. This point of view is underwritten by Cox’s theorem (Cox, 
1946, 1961), which asserts that any ‘natural’ extension of Aristotelian logic to 
R- valued truth values is probabilistic, and specifically Bayesian, although the 
‘naturality’ of the hypotheses has been challenged by, e.g., Halpern (1999a, b). 

It is also worth noting that there is a significant community that, in 
addition to being frequentist or Bayesian, asserts that selecting a single 
probability measure is too precise a description of uncertainty. These ‘imp- 
recise probabilists’ count such distinguished figures as George Boole and 
John Maynard Keynes among their ranks, and would prefer to say that 
4 — 2“ 100 < P [heads] < | + 2 -100 than commit themselves to the assertion 
that P [heads] = imprecise probabilists would argue that the former asser- 
tion can be verified, to a prescribed level of confidence, in finite time, whereas 
the latter cannot. Techniques like the use of lower and upper probabilities (or 
interval probabilities) are popular in this community, including sophisticated 
generalizations like Dempster-Shafer theory; one can also consider feasible 
sets of probability measures , which is the approach taken in Chapter 14. 
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Exercise 2.1. Let X be any C n - valued random variable with mean m G C n 
and covariance matrix 


C :=E[(X -m)(X 



e C 


n X n 


(a) Show that C is conjugate-symmetric and positive semi-definite. For what 
collection of vectors in C n is C the Gram matrix? 

(b) Show that if the support of X is all of C n , then C is positive definite. 
Hint: suppose that C has non-trivial kernel, construct an open half-space 
H of C n such that X ^ H almost surely. 
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Exercise 2.2. Let X be any random variable taking values in a Hilbert space 
7/, with mean and covariance operator C : T-L x T-L C defined by 


C(h,k) 


E 


{h,X 


m)(k , X — m) 


for h, k G 7L Show that C is conjugate-symmetric and positive semi-definite. 
Show also that if there is no subspace S C B with dim S > 1 such that 
X _L S with probability one), then C is positive definite. 

Exercise 2.3. Prove the finite-dimensional Cameron-Martin formula of 
Lemma 2.40. That is, let fi = A 7(ra, C) be a Gaussian measure on and 
let v G and show that the push- forward of fi by translation by v, namely 
A f(m + v, C), is equivalent to /i and 


d(T v )»/i 

d/r 


(x) = exp ( (i?, x 


i.e., for every integrable function /, 



C 1 2 



5 


f(x + v) d fi{x) 


f(x ) exp 



™) c - 1 


1 

2 



d/i(x). 


Exercise 2.4. Let T : 7/ — > X be a bounded linear map between Hilbert 
spaces 7/ and /C, with adjoint T* : X B, and let /i = Af(m, C ) be a Gaus- 
sian measure on 7/. Show that the push-forward measure T*/i is a Gaussian 
measure on X and that T*fi = Af(Tm, TCT*). 

Exercise 2.5. For i = 1, 2, let Xi ^ A independent Gaussian 
random variables taking values in Hilbert spaces 7G, and let T { : Bi X be 
a bounded linear map taking values in another Hilbert space /C, with adjoint 
T * : X Hi. Show that T\X\ + T 2 X 2 is a Gaussian random variable in X 
with 

T 1 X 1 + T 2 X 2 ~ V(Timi + T 2 m 2 , TiCjT* + T 2 C 2 T 2 *) . 

Give an example to show that the independence assumption is necessary. 

Exercise 2.6. Let 7/ and X be Hilbert spaces. Suppose that A: B —> B and 
C : X —> X are self-adjoint and positive definite, that B : B X, and that 
D: X — > X is self-adjoint and positive semi-definite. Show that the operator 
from B ® X to itself given in block form by 


A + B*CB 

-cb cad 


is self-adjoint and positive-definite. 

Exercise 2.7 (Inversion lemma). Let B and X be Hilbert spaces, and let 
A: B — > B, B : X B, C: B — > /C, and D : X X be linear maps. Define 
M \ B A X ^ B A X in block form by 
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M = 


A 

C 


B 

D 


Show that if A, D, A — BD 1 C and D — CA l B are all non- singular, then 


-l 


= 


and 


M~ l = 


A ” 1 + A~ l B(D - CA~ 1 B)~ 1 CA~ l -A~ l B(D - CA^B )- 1 
—(D - CA~ 1 B)~ 1 CA ~ 1 (D - CA~ 1 B )~ 1 


(A- BD-'C )- 1 -(A- BD- 1 C)~ 1 BD ~ 1 

-D~ 1 C(A - BD^C )- 1 D~ l + D~ l C(A - BD~ 1 C)~ 1 BD ~ 1 


Hence derive the Woodbury formula 

(A + BD-'C )- 1 = A - 1 - A~ 1 B(D + C a -1 B)~ l C A -1 . (2.9) 


Exercise 2.8. Exercise 2.7 has a natural interpretation in terms of the con- 
ditioning of Gaussian random variables. Let (X,Y) ~ J\[(m,C) be jointly 
Gaussian, where, in block form, 


rrii 

, c = 

~C\\ 

Cl 2 

rri 2 

1 

°12 

C22 


and C is self-adjoint and positive definite. 

(a) Show that C\\ and C22 are self-adjoint and positive-definite. 

(b) Show that the Schur complement S defined by S := C\\ — C12C22C is 
self-adjoint and positive definite, and 

o — l c'-l/i ri— 1 

(j-l _ ^ ^ ^12^22 

s^i—i/^ 1 * o—i 1 c—im 1 

°22 0 12 ^ °22 ' ^22 ° 12 ^ ^ 12^22 


(c) Hence prove Theorem 2 . 54 , that the conditional distribution of X given 
that Y = y is Gaussian: 

(X\ Y = y)~Af(m 1 + C 12 C^(y -m 2 ),S). 


Chapter 3 

Banach and Hilbert Spaces 


Dr. von Neumann, ich mochte gern wissen, 
was ist dann eigentlich ein Hilbertscher 
Raum? 


David Hilbert 


This chapter covers the necessary concepts from linear functional analysis 
on Hilbert and Banach spaces: in particular, we review here basic construc- 
tions such as orthogonality, direct sums and tensor products. Like Chapter 2, 
this chapter is intended as a review of material that should be understood as 
a prerequisite before proceeding; to an extent, Chapters 2 and 3 are interde- 
pendent and so can (and should) be read in parallel with one another. 


3.1 Basic Definitions and Properties 

In what follows, DC will denote either the real numbers R or the complex 
numbers C, and | • | denotes the absolute value function on DC. Ah the vector 
spaces considered in this book will be vector spaces over one of these two 
fields. In DC, notions of ‘size’ and ‘closeness’ are provided by the absolute 
value function |-|. In a normed vector space, similar notions of ‘size’ and 
‘closeness’ are provided by a function called a norm, from which we can build 
up notions of convergence, continuity, limits and so on. 

Definition 3.1. A norm on a vector space V over DC is a function || • || : V -A R 
that is 

(a) positive semi- definite: for ah x E V, ||x|| >0; 

(b) positive definite: for ah x E V, ||x|| = 0 if and only if x = 0; 
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(c) positively homogeneous : for all x E V and a E K, ||cur|| = |<a|||x||; and 

(d) sublinear : for all x,y E V, ||x + 2 /|| < ||x|| + \\y\\. 

If the positive definiteness requirement is omitted, then || • || is said to be a 
seminorm. A vector space equipped with a norm (resp. seminorm) is called 
a normed space (resp. seminormed space). 

In a normed vector space, we can sensibly talk about the ‘size’ or ‘length’ 
of a single vector, but there is no sensible notion of ‘angle’ between two 
vectors, and in particular there is no notion of orthogonality. Such notions 
are provided by an inner product: 

Definition 3.2. An inner product on a vector space V over K is a function 
(•,•): V x V K. that is 

(a) positive semi- definite: for all x E V, (ay a?) > 0; 

(b) positive definite : for all x E V, (x,x) = 0 if and only if x = 0; 

(c) conjugate symmetric: for all G V, (x,y) = (y,x); and 

(d) sesquilinear: for all ay y, z E V and all a, /? E K, (ay ay + fiz) = a (ay y) + 
P{x,z). 

A vector space equipped with an inner product is called an inner product 
space. In the case K. = R, conjugate symmetry becomes symmetry, and 
sesquilinearity becomes bilinearity. 

Many texts have sesquilinear forms be linear in the first argument, rather 
than the second as is done here; this is an entirely cosmetic difference that 
has no serious consequences, provided that one makes a consistent choice and 
sticks with it. 

It is easily verified that every inner product space is a normed space under 
the induced norm 

||x|| := ■>/ (ay x). 

The inner product and norm satisfy the Cauchy-Schwarz inequality 

\{x, y)\ < ||o:|| \\y\\ for all ay y E V, (3.1) 

where equality holds in (3.1) if and only if x and y are scalar multiples of one 
another. Every norm on V that is induced by an inner product satisfies the 
parallelogram identity 

\\ x + y \\ 2 + \\x — y\\ 2 = 2 ||a^|| 2 + 2||?/|| 2 for all ay y E V. (3.2) 

In the opposite direction, if || • || is a norm on V that satisfies the parallelogram 
identity (3.2), then the unique inner product ( • , • ) that induces this norm is 
found by the polarization identity 
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in the real case, and 

\\x + y\\ 2 — \\x — y\\ 2 \\ix — y\\ 2 — \\ix + y\\ 2 

x , y ) = 4 + * 4 

in the complex case. 

The simplest examples of normed and inner product spaces are the familiar 
finite-dimensional Euclidean spaces: 

Example 3.3. Here are some finite-dimensional examples of norms on ME: 

(a) The absolute value function | • | is a norm on M. 

(b) The most familiar example of a norm is probably the Euclidean norm or 
2-norm on M n . The Euclidean norm of v = (Ti, . . . , v n ) G M n is given by 



v 2 := 


n 


E 


l Vi\ 2 


v. 


n 


E 


V • 6i 


(3.5) 


i — 1 


The Euclidean norm is the induced norm for the inner product 

n 

(u,v) := 


UiVi. 


(3.6) 


i—1 


In the case K. = R this inner product is commonly called the dot product 
and denoted u • v. 

(c) The analogous inner product and norm on K mxn ofmxn matrices is 
the Frobenius inner product 


(A, B) = A : B := ^ 


(d) The 1-norm , also known as the Manhattan norm or taxicab norm , on MJ 1 
is defined by 

n 

IM|i:=EM- ( 3 - 7 ) 

i—1 

(e) More generally, for 1 < p < oo, the p-norm on K n is defined by 


v 




(3.8) 


(f) Note, however, that the formula in (3.8) does not define a norm on 
if p < 1. 

(g) The analogous norm for p = oo is the oo -norm or maximum norm on K n : 


p || oo := max | Vi 


(3.9) 
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There are also many straightforward examples of infinite-dimensional 
normed spaces. In UQ applications, these spaces often arise as the solution 
spaces for ordinary or partial differential equations, spaces of random vari- 
ables, or spaces for sequences of coefficients of expansions of random fields 
and stochastic processes. 


Example 3.4. (a) An obvious norm to define for a sequence v = (v n ) n eN 
is the analogue of the maximum norm. That is, define the supremum 
norm by 

IMIoo := sup \v n \. (3.10) 

nCN 


Clearly, if v is not a bounded sequence, then |H|oo = oo. Since norms 
are not allowed to take the value oo, the supremum norm is only a norm 
on the space of bounded sequences ; this space is often denoted l°° , or 
sometimes if we wish to emphasize the field of scalars, or B( N; DC) 

if we want to emphasize that it is a space of bounded functions on some 
set, in this case N. 

(b) Similarly, for 1 < p < oo, the p-norm of a sequence is defined by 


v 




(3.11) 


The space of sequences for which this norm is finite is the space of p- 
summable sequences , which is often denoted £ P (K) or just £ p . The state- 
ment from elementary analysis courses that Y^=i h harmonic series) 
diverges but that ^ converges is the statement that 

(i,i, !’•••) e £ 2 but (i,|> !>•••) 

(c) If S is any set, and S(5;K) denotes the vector space of all bounded DC- 
valued functions on 5, then a norm on B(S; DC) is the supremum norm 
(or uniform norm ) defined by 


ll/lloo := sup \f(x)\. 
xES 

(d) Since every continuous function on a closed and bounded interval is 
bounded, the supremum norm is also a norm on the space C°([0, 1]; R) of 
continuous real-valued functions on the unit interval. 


There is a natural norm to use for linear functions between two normed 
spaces: 

Definition 3.5. Given normed spaces V and W, the operator norm of a 
linear map A : V W is 
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If 1 1 A 1 1 is finite, then A is called a bounded linear operator. The operator norm 
of A will also be denoted ||A|| 0 p or ||A||v->>w- There are many equivalent 
expressions for this norm: see Exercise 3.1. 

Definition 3.6. Two inner product spaces (V, ( • , *)y) and (W, ( • , -)w) 
are said to be isometrically isomorphic if there is an invertible linear map 
T : V W such that 

(Tu,Tv) w = (u,v)v for all it, v G V. 

The two inner product spaces are then ‘the same up to relabelling’. Similarly, 
two normed spaces are isometrically isomorphic if there is an invertible linear 
map that preserves the norm. 

Finally, normed spaces are examples of topological spaces, in that the norm 
structure induces a collection of open sets and (as will be revisited in the next 
section) a notion of convergence: 

Definition 3.7. Let V be a normed space: 

(a) For x E V and r > o, the open ball of radius r centred on x is 

M r (x) := {y E V | \\x — y || < r} (3.12) 

and the closed ball of radius r centred on x is 

M r (x) := {y E V | \\x — y || < r}. (3.13) 

(b) A subset U C V is called an open set if, for all x E A, there exists 
r = r(x) > 0 such that M r (x) C U. 

(c) A subset F C V is called a closed set if V \ F is an open set. 


3.2 Banach and Hilbert Spaces 

For the purposes of analysis, rather than pure algebra, it is convenient if 

normed spaces are complete in the same way that R is complete and Q is 

not: 

Definition 3.8. Let (V, || • ||) be a normed space. 

(a) A sequence (x n ) n ^N hr V converges to x e V if, for every e > 0, there 
exists TV G N such that, whenever n > TV, \\x n — x\\ < e. 

(b) A sequence ( x n ) ne ^ m V is called Cauchy if, for every e > 0, there exists 
N G N such that, whenever m, n > TV, ||x m — x n || <5. 

(c) A complete space is one in which each Cauchy sequence in V converges 
to some element of V. Complete normed spaces are called Banach spaces , 
and complete inner product spaces are called Hilbert spaces. 
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It is easily verified that a subset F of a normed space is closed (in the 
topological sense of being the complement of an open set) if and only if it is 
closed under the operation of taking limits of sequences (i.e. every convergent 
sequence in F has its limit also in F), and that closed linear subspaces of 
Banach (resp. Hilbert) spaces are again Banach (resp. Hilbert) spaces. 

Example 3.9. (a) DC n and DC mXn are finite-dimensional Hilbert spaces with 
respect to their usual inner products. 

(b) The standard example of an infinite-dimensional Hilbert space is the 
space f? 2 (DC) of square-summable DC- valued sequences, which is a Hilbert 
space with respect to the inner product 

(x,y)e 2 := 

nCN 

This space is the prototypical example of a separable Hilbert space, i.e. 
it has a countably infinite dense subset, and hence countably infinite 
dimension. 

(c) On the other hand, the subspace of l 2 consisting of all sequences with 
only finitely many non-zero terms is a non-closed subspace of f? 2 , and not 
a Hilbert space. Of course, if the non-zero terms are restricted to he in a 
predetermined finite range of indices, say {1, . . . , n}, then the subspace 
is an isomorphic copy of the Hilbert space DC n . 

(d) Given a measure space (T, jF,/r), the space F 2 (T,/qDC) of (equivalence 
classes modulo equality fi- almost everywhere of) square-integrable func- 
tions from X to DC is a Hilbert space with respect to the inner product 


(/,3)l 2 ( p ) := / f(x)g(x)dfi(x). (3.14) 

J X 

Note that it is necessary to take the quotient by the equivalence relation 
of equality /i-almost everywhere since a function / that vanishes on a set 
of full measure but is non-zero on a set of zero measure is not the zero 
function but nonetheless has ||/||l 2 (/i) = 0. When (T, jF,/r) is a proba- 
bility space, elements of F 2 (T,/qDC) are thought of as random variables 
of finite variance, and the L 2 inner product is the covariance: 

(X,Y) L 2 M := E^{XY]=cov(X,Y). 

When L 2 (T,/qDC) is a separable space, it is isometrically isomorphic to 
f? 2 (DC) (see Theorem 3.24). 

(e) Indeed, Hilbert spaces over a fixed field DC are classified by their dim- 
ension: whenever T~L and /C are Hilbert spaces of the same dimension over 
DC, there is an invertible DC-linear map T: H /C such that (Tx,Ty)jc = 
(x, y)'^ for all x, y E T-L. 


3.2 Banach and Hilbert Spaces 


41 


Example 3.10. (a) For a compact topological space A, the space C°(A; DC) 
of continuous functions / : X DC is a Banach space with respect to the 
supremum norm 

ll/lloo := sup \f(x)\. (3.15) 

xEX 

For non-compact A, the supremum norm is only a bona fide norm if 
we restrict attention to bounded continuous functions, since otherwise it 
would take the inadmissible value +oo. 

(b) More generally, if A is the compact closure of an open subset of a Banach 
space V, and r G No, then the space C r (X ; DC) of all r-times continuously 
differentiable functions from A to IK is a Banach space with respect to 
the norm 


/lie- :=Ell Dfc l 


OO 


k=0 


Here, D f(x) : V — > DC denotes the first-order Frechet derivative of / at x, 
the unique bounded linear map such that 

lim I/O) - f(x) - D f(x)(y - x)| = Q 

y^ x \\y — x|| ’ 

in X 11^ II 

D 2 / (%) = D(D/)(a;) : V xV^K denotes the second-order Frechet deriva- 
tive, etc. 

(c) For 1 < p < oo, the spaces L P (X , from Definition 2.21 are Banach 
spaces, but only the L 2 spaces are Hilbert spaces. As special cases (X = 
N, and p = counting measure), the sequence spaces l p are also Banach 
spaces, and are Hilbert if and only if p = 2. 

Another family of Banach spaces that arises very often in PDE appli- 
cations is the family of Sobolev spaces. For the sake of brevity, we limit 
the discussion to those Sobolev spaces that are also Hilbert spaces. To 
save space, we use multi-index notation for derivatives: for a multi-index 
a := (aq, . . . , a n ) G Nq , with \a\ := aq + • • • + a n , 


d a u(x) :■ 


d\ a \ u 


d ai xi . . . d ari x n 


{x). 


Sobolev spaces consist of functions 1 that have appropriately integrable weak 
derivatives, as defined by integrating by parts against smooth test functions: 


1 To be more precise, as with the Lebesgue L p spaces, Sobolev spaces consist of equivalence 

classes of such functions, with equivalence being equality almost everywhere. 
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Definition 3.11. Let X C M n , let a G Ng, and consider n: T R. A weak 
derivative of order a for u is a function v : X R such that 



u(x)d a c/)(x ) dx = (—1)1“ 



v(x)cj)(x) &x 


(3.16) 


for every smooth function <f > : X R that vanishes outside a compact subset 
supp(</>) C X. Such a weak derivative is usually denoted d a u as if it were a 
strong derivative, and indeed coincides with the classical (strong) derivative 
if the latter exists. For s G No, the Sobolev space H S (X) is 


H\X ) 


\u G L 2 (X) 


for all a G Nq with \a\ < s, 
u has a weak derivative d a u G L 2 



with the inner product 


(3.17) 


(u,v)hs := (d a u 1 d a v) L 2 . (3.18) 

lal <s 


The following result shows that smoothness in the Sobolev sense implies 
either a greater degree of integrability or even Holder continuity. In partic- 
ular, possibly after modification on sets of Lebesgue measure zero, Sobolev 
functions in H s are continuous when s > n/2. Thus, such functions can be 
considered to have well-defined pointwise values. 

Theorem 3.12 (Sobolev embedding theorem). Let X C M n be a Lips- 
chitz domain (i.e. a connected set with non-empty interior , such that dX 
can always be locally written as the graph of a Lipschitz function of n — 1 
variables). 

(a) If s < n/2, then H S (X ) C L q (X), where ^ = \ — and there is a 
constant C = C(s,n,X) such that 


u\\ L*(X) < C\\u\\ H s(x ) for all u G H S (X). 


(b) If s > n/2, then H S (X) C C s L n / 2 J 1 ’3'(Af) ; where 


7 = 


[n/2\ + 1 - n/2, 
any element of (0, 1), 


if n is odd, 
if n is even, 


and there is a constant C = C(s,n,^, X) such that 

IMIc s -L n /2J-i.7(Ar) < C\\u\\ H s W for all u G H S (X), 
where the Holder norm is defined (up to equivalence) by 


u \\ C k ^(X) 


u\\ck + sup 

x,y£X 

x^y 


D k u(x) - D k u(y) 
\x - y\ 
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Dual Spaces. Many interesting properties of a vector space are encoded 
in a second vector space whose elements are the linear functions from the 
first space to its field. When the vector space is a normed space, 2 so that 
concepts like continuity are defined, it makes sense to study continuous linear 
functions: 

Definition 3.13. The continuous dual space of a normed space V over K. is 
the vector space V' of all bounded (equivalently, continuous) linear functionals 
£: V -A DC. The dual pairing between an element £ E V' and an element v G V 
is denoted (£ \ v) or simply £{y). For a linear functional f on a seminormed 
space V, being continuous is equivalent to being bounded in the sense that 
its operator norm (or dual norm ) 



sup 

o^vev 


\(£ 

\ v )\ 


v\ 



sup | (£ | v) 

vev 

v||=i 


sup \(£\v) 

vev 
vii <i 


is finite. 

Proposition 3.14. For every normed space V, the dual space V' is a Banach 
space with respect to || • || 7 . 

An important property of Hilbert spaces is that they are naturally self- 
dual : every continuous linear functional on a Hilbert space can be naturally 
identified with the action of taking the inner product with some element of 
the space: 

Theorem 3.15 (Riesz representation theorem). Let Li be a Hilbert space. 
For every continuous linear functional f E H' , there exists f^^H such that 
if w = (/** , x) for all x gH. Furthermore, the map f i-a f$ is an isometric 
isomorphism between Ft and its dual. 

The simplicity of the Riesz representation theorem for duals of Hilbert 
spaces stands in stark contrast to the duals of even elementary Banach spaces, 
which are identified on a more case-by-case basis: 

• For 1 < p < oo, L p (X,p) is isometrically isomorphic to the dual of 

L q (X,p), where ^ ^ = 1. This result applies to the sequence space £ p , 

and indeed to the finite-dimensional Banach spaces M n and C n with the 

norm ||x|| p := (XAi \xi\ p ) 1/P - 

• By the Riesz-Markov-Kakutani representation theorem, the dual of the 
Banach space C c (X) of compactly supported continuous functions on a 
locally compact Hausdorff space X is isomorphic to the space of regular 
signed measures on X. 


Or even just a topological vector space. 


2 
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The second example stands as another piece of motivation for measure theory 
in general and signed measures in particular. Readers interested in the details 
of these constructions should refer to a specialist text on functional analysis. 

Adjoint Maps. Given a linear map A: V — > W between normed spaces V 
and W, the adjoint of A is the linear map A * : W' — > V' defined by 

(A*£ | v) = (£ | Av) for all v E V and t E Wb 

The following properties of adjoint maps are fundamental: 

Proposition 3.16. Let U, V and W be normed spaces , let A, B: V W 
and C : U V be bounded linear maps , and let a and (3 be scalars. Then 

(a) A *: W' — >■ V' is bounded, with operator norm ||A*|| = ||A||; 

(b) (a A + (3 By = aA* + /TB*; 

(c) (AC)* = C*A*; 

(d) the kernel and range of A and A * satisfy 

ker A* = (ran A) -1 := {t E yk 7 | | Av) = 0 for all v E V} 

(ker A*) -1 = ran A. 

When considering a linear map A: Tt ^ JC between Hilbert spaces Ti and 
/C, we can appeal to the Riesz representation theorem to identify Ti' with Ti, 
JC' with JC, and hence define the adjoint in terms of inner products: 


(A*fc, h)u = (fc, Ah)/c for all h E 7/ and fc E 1C. 


With this simplification, we can add to Proposition 3.16 the additional prop- 
erties that A** = A and ||A*A|| = ||AA*|| = 1 1 A 1 1 2 . Also, in the Hilbert 
space setting, a linear map A: Ti ^ Ti is said to be self-adjoint if A = A*. 
A self-adjoint map A is said to be positive semi-definite if 


inf 

x^O 


(x, Ax) 



> 0 , 


and positive definite if this inequality is strict. 

Given a basis {e^ G / of Ti, the corresponding dual basis of Ti 

is defined by the relation (e l , ef)^ = Sij. The matrix of A with respect to 
bases {ei}i e j of Ti and {fj}jeJ °f ^ and the matrix of A* with respect to 
the corresponding dual bases are very simply related: the one is the conju- 
gate transpose of the other, and so by abuse of terminology the conjugate 
transpose of a matrix is often referred to as the adjoint. 

Thus, self-adjoint bounded linear maps are the appropriate generalization 
to Hilbert spaces of symmetric matrices over R or Hermitian matrices over 
C. They are also particularly useful in probability because the covariance 
operator of an Ti- valued random variable is a self-adjoint (and indeed positive 
semi-definite) bounded linear operator on Ti. 
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Orthogonal decompositions of Hilbert spaces will be fundamental tools in 
many of the methods considered later on. 


Definition 3.17. A subset E of an inner product space V is said to be 
orthogonal if (x, y) = 0 for all distinct elements x, y E E; it is said to be 
orthonormal if 


(x,y) 


1, if x = y E E, 

0, if x, y E E and x y. 


Lemma 3.18 (Gram-Schmidt). Let ( x n ) ne ^ be any sequence in an inner 
product space V, with the first d E No U {oo} terms linearly independent. 
Inductively define (u n ) ne ^ and (e n ) n eN by 


u 


n 


X 


n 


n — 1 

E 

k=l 


{Xni 'U'k) 

— To- Wife, 

U k 


e 


n 


U n 

U n 


Then (u n ) ne ^ (resp. (e n ) n eN^ is a sequence of d orthogonal (resp. orthonor- 
mal) elements ofV, followed by zeros if d < oo. 

Definition 3.19. The orthogonal complement E 1 - of a subset E of an inner 
product space V is 


E^ := {y G V | for all (y, x) = 0}. 


The orthogonal complement of E C V is always a closed linear subspace 
of V, and hence if V = TL is a Hilbert space, then E 1 - is also a Hilbert space 
in its own right. 

Theorem 3.20. Let K be a closed subspace of a Hilbert space TL. Then, for 
any x E TL, there is a unique LI^x E 1C that is closest to x in the sense that 


IIjcx — x\\ = inf || y — x 
yeK, 


Furthermore, x can be written uniquely as x = LIjcx + z, where z E /C x . 
Hence, TL decomposes as the orthogonal direct sum 

u = ic@k l . 


Theorem 3.20 can be seen as a special case of closest-point approxima- 
tion among convex sets: see Lemma 4.25 and Exercise 4.2. The operator 
Hn : 7-^ — ^ /C is called the orthogonal projection onto 1C. 
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Theorem 3.21. Let 1C be a closed subspace of a Hilbert space H. The cor- 
responding orthogonal projection operator IJjc is 

(a) a continuous linear operator of norm at most 1; 

(b ) with I — 77/c = TI^ ; 
and satisfies, for every x G TL, 

(c) \\xf = \\n K x\\ 2 T\\{I-n K )x\\ 2 ; 

(d) IJjcx = x <^=4> x G JC; 

(e) IIjcx = 0 <^=4> x G 1C 2 -. 

Example 3.22 (Conditional expectation). An important probabilistic app- 
lication of orthogonal projection is the operation of conditioning a random 
variable. Let ((9,J^,//) be a probability space and let X G L 2 ((9, /x; K) 

be a square- integr able random variable. If C & is a cr-algebra, then the 
conditional expectation of X with respect to usually denoted E[X|£f], is the 
orthogonal projection of X onto the subspace L 2 ((9, //; K). In elementary 

contexts, is usually taken to be the a - algebra generated by a single event 
E of positive //-probability, i.e. 


» = {0,[IG4[I^],0}; 


or even the trivial a - algebra {0, O }, for which the only measurable functions 
are the constant functions, and hence the conditional expectation coincides 
with the usual expectation. The orthogonal projection point of view makes 
two important properties of conditional expectation intuitively obvious: 

(a) Whenever Wi C % Q CE , L 2 (0 , //; K) is a subspace of L 2 ((9 , K) 

and composition of the orthogonal projections onto these subspace yields 
the tower rule for conditional expectations: 


E[X|^i] =E[E[X|%] 




and, in particular, taking to be the trivial cr-algebra {0, G }, 


E[X] = E[E[A|%]]> 

(b) Whenever I,h G L 2 ((9, CE , //; K) and A is, in fact, ^-measurable, 


E[XT|£f] = XE[Y\&]. 


Direct Sums. Suppose that V and W are vector spaces over a common field 
K. The Cartesian product V x W can be given the structure of a vector space 
over K by defining the operations componentwise: 


(v, w) + (//, w') := (v + v' , w -f re'), 
a(v, rc) := (at;, arc), 
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for all v,v' E V, w, w' E W, and a E K. The resulting vector space is called 
the (algebraic) direct sum of V and W and is usually denoted by V © W, 
while elements of V © W are usually denoted by v © w instead of (v , re). 

If {e^ |z E 1} is a basis of V and {ey |j E J} is a basis of W, then | fc E 
iL := I l±J J} is basis of V © W. Hence, the dimension of V © W over K is 
equal to the sum of the dimensions of V and W. 

When 7/ and /C are Hilbert spaces, their (algebraic) direct sum Ft © /C can 
be given a Hilbert space structure by defining 

(h © k, ti © k’)u®K •= (ft, ft') ft + (ft, k'))c 

for all h, h! E 7/ and k,k' E JC. The original spaces 7/ and /C embed into 

7/ © /C as the subspaces 7/ © {0} and {0} © /C respectively, and these two 

subspaces are mutually orthogonal. For this reason, the orthogonality of the 

two summands in a Hilbert direct sum is sometimes emphasized by the not- 
_L 

ation Ft © JC. The Hilbert space projection theorem (Theorem 3.20) was 
the statement that whenever /C is a closed subspace of a Hilbert space 7/, 

7/ = /C©/C ± . 

It is necessary to be a bit more careful in defining the direct sum of count- 
ably many Hilbert spaces. Let FL n be a Hilbert space over K for each n E N. 
Then the Hilbert space direct sum 7 i := ® nGN 7 i n is defined to be 

x n E 1-L n for each n E N, and 1 
x n = 0 for all but finitely many n J ’ 

where the completion is taken with respect to the inner product 

(x,y)-H ■= 5 ~2{xn,yn)u n , 

nCN 

which is always a finite sum when applied to elements of the generating 
set. This construction ensures that every element x of l~t has finite norm 
\\ x \\n = ^2ne n \\ x n\\n n - t )e f ° re 5 eac h of the summands 7 i n is a subspace 
of 1~L that is orthogonal to all the others. 

Orthogonal direct sums and orthogonal bases are among the most impor- 
tant constructions in Hilbert space theory, and will be very useful in what 
follows. Prototypical examples include the standard ‘Euclidean’ basis of £ 2 
and the Fourier basis {e n \ n E Z} of L 2 (S 1 ;C), where 

1 

e n (x) := — exp(mx). 

2tt 


H\= \ x = ( x n ) ne 


N 


3 Completions of normed spaces are formed in the same way as the completion of Q to form 
R: the completion is the space of equivalence classes of Cauchy sequences, with sequences 
whose difference tends to zero in norm being regarded as equivalent. 
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Indeed, Fourier’s claim 4 that any periodic function / could be written as 

f(x) = ^2f n e n (x), 

In ■■= [ f(y)e n (y)dy, 

J s 1 

can be seen as one of the historical drivers behind the development of much 
of analysis. For the purposes of this book’s treatment of UQ, key examples 
of an orthogonal bases are given by orthogonal polynomials , which will be 
considered at length in Chapter 8. 

Some important results about orthogonal systems are summarized below; 
classically, many of these results arose in the study of Fourier series, but hold 
for any orthonormal basis of a general Hilbert space. 

Lemma 3.23 (Bessel’s inequality). Let V be an inner product space and 
(e n )ne n an orthonormal sequence in V. Then, for any x E V, the series 
l( e n? x )\ 2 coriver 9 es an d satisfies 

Y |( e w; x )\ 2 < ||z|| 2 - (3.19) 

nG N 


Theorem 3.24 (Parseval identity). Let (e n )neN be an orthonormal sequence 
in a Hilbert space TL, and let (<a n ) nG ^ be a sequence in K. Then the series 
converges in TL if and only if the series \ a n\ 2 converges in 

M, in which case 


2 


^ ^ o n e n 

nCN 


Y \ a l 2 - 

nCN 


(3.20) 


Hence, for any x E TL, the series ^fi jne ^{a n ,x)e n converges. 

Theorem 3.25. Let (e n ) nG ^ be an orthonormal sequence in a Hilbert space 
TL. Then the following are equivalent: 

(a) {e n | n g Np = {0); 

(b) TL = span{e n | n E N}; 

(c) TL = ® neN Ke n as a direct sum of Hilbert spaces; 

(d) for all x e H, ||x|| 2 = )C„gN \( e n, x )\ 2 ; 

(e) for all x eH, x = ^2 neN (e n - x)e n . 

If one (and hence all) of these conditions holds true, then ( e n ) ne ^ is called a 
complete orthonormal basis for TL 


4 Of course, Fourier did not use the modern notation of Hilbert spaces! Furthermore, if he 
had, then it would have been ‘obvious’ that his claim could only hold true for L 2 functions 
and in the L 2 sense, not pointwise for arbitrary functions. 
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Corollary 3.26. Let (e n ) n eN be a complete orthonormal basis for a Hilbert 
space H. For every x E H, the truncation error x — x ) e n is orthog- 

onal to spanjei, . . . , e tv}. 


Proof. Let v := J]m=i % e m £ spanjei, . . . , e^} be arbitrary. By complete- 
ness, 

x = y^(e n , x)e n . 

nCN 


Hence, 


x 


N \ 

- /, { e n,x) e n, v ) 

n — 1 / 


N 


— ( ^ ^ ( e n? x )0m ^ ^ 


x mOm 


t n>N 


m— 1 


— ^ ^ ( {On •> x )e n , V rn e 


m 


n>N 




n>N 

rae{0,...,7V} 


= 0 


since (e n , e m ) = £ nm , and m / n in the double sum. □ 

Remark 3.27. The results cited above (in particular, Theorems 3.20, 3.21, 
and 3.25, and Corollary 3.26) imply that if we wish to find the closest point of 
spanjei, . . . , e^v} to some x = Xl n eN( en ’ x ) e n , then this is a simple matter of 

series truncation: the optimal approximation is x ~ x^ N ^ := £„= li e n,x) e n. 
Furthermore, this operation is a continuous linear operation as a function of 
x, and if it is desired to improve the quality of an approximation x x( N ^ in 
spanjei, . . . , e^} to an approximation in, say, spanjei, . . . , e 7 v+i}, then the 
improvement is a simple matter of calculating (e/v+i,#) and adjoining the 
new term (ejv+i, x )on+i to form a new norm-optimal approximation 


N+l 

X « x (Ar+1) := (e n , x)e n = x (Ar) + (ejv+i, a;)ejv+i. 

n — 1 


However, in Banach spaces (even finite-dimensional ones), closest-point app- 
roximation is not as simple as series truncation, and the improvement of 
approximations is not as simple as adjoining new terms: see Exercise 3.4. 
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3.5 Tensor Products 

The heuristic definition of the tensor product V^W of two vector spaces V 
and W over a common field DC is that it is the vector space over DC with basis 
given by the formal symbols {e^ (g) fj I * G I,j G J}, where {e(\ i G /} is a 
basis of V and {fj\j G J} is a basis of W. Alternatively, we might say that 
elements of V (8> W are elements of W with V-valued rather than DC-valued 
coefficients (or elements of V with W-valued coefficients). However, it is not 
immediately clear that this definition is independent of the bases chosen for 

V and W. A more thorough definition is as follows. 

Definition 3.28. The free vector space Fy x w on the Cartesian product 

V x W is defined by taking the vector space in which the elements of V x W 
are a basis: 




n G N and, for i = 1, . . . , n 
oti G DC, ( Vi,Wi ) G V x W 


The ‘freeness’ of Fy x yy; is that the elements are, by definition, lin- 

early independent for distinct pairs (v, w) G V x W; even e^o) and e(_ V;0 ) are 
linearly independent. Now define an equivalence relation ~ on Ty x y y such 
that 



&(av,w) ^ £(v,aw) 


for arbitrary v,v* G V, re, re 7 G W, and a G DC. Let A* be the subspace of 
Ty x w generated by these equivalence relations, i.e. the equivalence class of 
6 ( 0 , 0 )- 

Definition 3.29. The (algebraic) tensor product VCCW is the quotient space 



One can easily check that V0W, as defined in this way, is indeed a 
vector space over DC. The subspace R of Ty x >y is mapped to the zero element 
of V < 8 > W under the quotient map, and so the above equivalences become 
equalities in the tensor product space: 


(v + v') ® w = v (8) w + v 0 re 
u G) (re + u/) = u(8)ro + u(8)u/ 


a(u (8) w) = (or) (8) re = v (8) (ore) 


for all VyV G T, rc, w G W, and o G DC. 
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One can also check that the heuristic definition in terms of bases holds 
true under the formal definition: if {ei\ i G 1} is a basis of V and {fj\j G J} 
is a basis of W, then {e$ (g) fj I i € j,i e J} is basis of V (8) W. Hence, the 
dimension of the tensor product is the product of dimensions of the original 
spaces. 

Definition 3.30. The Hilbert space tensor product of two Hilbert spaces H 
and /C over the same field DC is given by defining an inner product on the 
algebraic tensor product H ( 8 ) /C by 

(h 0 k,h' 0 k')n®ic •= (h, h')'u(k, k')jc for all h,h' <E H and fc, k' G /C, 


extending this definition to all of the algebraic tensor product by sesquilinear- 
ity, and defining the Hilbert space tensor product H 8) JC to be the completion 
of the algebraic tensor product with respect to this inner product and its as- 
sociated norm. 

Tensor products of Hilbert spaces arise very naturally when considering 
spaces of functions of more than one variable, or spaces of functions that 
take values in other function spaces. A prime example of the second type is 
a space of stochastic processes. 

Example 3.31. (a) Given two measure spaces (X, p) and (A, ^), con- 

sider L 2 (X x A, /k8>^; DC), the space of functions on X x y that are square 
integrable with respect to the product measure p (8) v. If / G L 2 (T, p; DC) 
and g G L 2 (y,v\ DC), then we can define a function h: A x y -T DC by 
h(x,y) := f(x)g(y). The definition of the product measure ensures that 
h G L 2 {X x y, p (8) zq DC), so this procedure defines a bilinear mapping 
L 2 (T, p\ DC) x L 2 (A, v\ DC) L 2 (X x A, p ® zq DC). It turns out that the 
span of the range of this bilinear map is dense in L 2 (X x y,p® zq DC) if 
L 2 (T, p; DC) and L 2 (A, zq DC) are separable. This shows that 

L 2 (T, p- DC) ( 8 ) L 2 (y, zq DC) 9* L 2 {X xy,p®v\ DC), 

and it also explains why it is necessary to take the completion in the 
construction of the Hilbert space tensor product. 

(b) Similarly, L 2 (T,/q?/), the space of functions / : X H that are square 
integrable in the sense that 



^ d p(x) < Too, 


is isomorphic to L 2 (X, p; DC) (8) H if this space is separable. The isomor- 
phism maps f<8xp G L 2 (T, p; DC )<8>'H to the H- valued function x i— >> f(x)ip 
in L 2 (T, p; H). 

(c) Combining the previous two examples reveals that 


L 2 (T,/q DC) (8)L 2 (»; DC) 


= L 2 (X x y, p 0 zq DC) 9* L 2 (X, p; L 2 (A, zqDC)). 
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Similarly, one can consider a Bochner space L P (A , /x; V) of functions 
(random variables) taking values in a Banach space V that are p th -power- 
integrable in the sense that f x ||/(^)||y d/a(x) is finite, and identify this space 
with a suitable tensor product L p (Y,/i;M) (8) V. However, several subtleties 
arise in doing this, as there is no single ‘natural’ Banach tensor product of 
Banach spaces as there is for Hilbert spaces. 


3.6 Bibliography 

Reference texts on elementary functional analysis, including Banach and 
Hilbert space theory, include the books of Reed and Simon (1972), Rudin 
(1991), and Rynne and Youngson (2008). The article of Deutsch (1982) gives 
a good overview of closest-point approximation properties for subspaces of 
Banach spaces. Further discussion of the relationship between tensor products 
and spaces of vector- valued integrable functions can be found in the books of 
Ryan (2002) and Hackbusch (2012); the former is essentially a pure mathe- 
matic text, whereas the latter also includes significant treatment of numerical 
and computational matters. The Sobolev embedding theorem (Theorem 3.12) 
and its proof can be found in Evans (2010, Section 5.6, Theorem 6). 

Intrepid students may wish to consult Bourbaki (1987), but the standard 
warnings about Bourbaki texts apply: the presentation is comprehensive but 
often forbiddingly austere, and so it is perhaps better as a reference text than 
a learning tool. On the other hand, the Hitchhiker’s Guide of Aliprantis and 
Border (2006) is a surprisingly readable encyclopaedic text. 


3.7 Exercises 

Exercise 3.1 (Formulae for the operator norm). Let A: V — > W be a linear 
map between normed vector spaces (V, || • ||y) and (W, || • ||yy). Show that the 
operator norm ||H||y_^yy; of A is equivalently defined by any of the following 
expressions: 
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Exercise 3.2 (Properties of the operator norm). Suppose that G, V, and W 
are normed vector spaces, and let A: U V and B: V — > W be bounded 
linear maps. Prove that the operator norm is 

(a) compatible (or consistent ) with || • || u and || • ||v : for all x E G, 

\\Au\\ v < ||^||w->.vlMlw- 

(b) sub -multiplicative: \\B o A\\u^yv < \\B\\ v-»>v||^||w->-v- 

Exercise 3.3 (Definiteness of the Gram matrix). Let V be a vector space 
over DC, equipped with a semi-definite inner product ( • , • ) (i.e. one satisfying 
all the requirements of Definition 3.2 except possibly positive definiteness). 
Given vectors tq, . . . , v n G V, the associated Gram matrix is 


G(v i, . . . , v n ) 


(vi,vi) 


(Vn,Vi) 


(vi,v n ) 
{Vn 5 V n ) 


(a) Show that, in the case that V = DC n with its usual inner product, 
G(v i, . . . , Vn) = V*V, where V is the matrix with the vectors vi as its 
columns, and U* denotes the conjugate transpose of V. 

(b) Show that G(rq, . . . , v n ) is a conjugate-symmetric (a.k.a. Hermitian) ma- 
trix, and hence is symmetric in the case DC = R. 

(c) Show that det G(tq, . . . , v n ) > 0. Show also that det G(^i, . . . , v n ) = 0 if 
vi, . . . , i? n are linearly dependent, and that this is an ‘if and only if’ if 
( • , • ) is positive definite. 

(d) Using the case n = 2, prove the Cauchy-Schwarz inequality ( 3 . 1 ). 

Exercise 3.4 (Closest-point approximation in Banach spaces). Let Re : M 2 
M 2 denote the linear map that is rotation of the Euclidean plane about the 
origin through a fixed angle Define a Banach norm || • ||# on M 2 

in terms of Re and the usual 1-norm by 


\\(x,y)\\o ■= II Re (x, y) 111. 

Find the closest point of the x-axis to the point (1,1), i.e. find x' G R to 
minimize || (rr 7 , 0 ) — (1, 1) || <9; in particular, show that the closest point is not 
(1,0). Hint: sketch some norm balls centred on (1, 1). 

Exercise 3.5 (Series in normed spaces). Many UQ methods involve series 
expansions in spaces of deterministic functions and/or random variables, so it 
is useful to understand when such series converge. Let (v n ) n eN b e a sequence 
in a normed space V. As in R, we say that the series Vn conver g es to 

v G V if the sequence of partial sums converges to v, i.e. if, for all e > 0, there 
exists N £ G N such that 


54 


3 Banach and Hilbert Spaces 


N 


N>N £ => v — 



n — 1 


(a) Suppose that Vn converges absolutely to v G V, i.e. the series con- 
verges and also ll^nll is finite. Prove the infinite triangle inequality 



nG N 


(b) Suppose that converges absolutely to v G V. Show that XlneN 

converges unconditionally to v G V, i.e. u 7r ( n ) converges to r G V 

for every bijection 7r: N — > N. Thus, the order of summation ‘does not 
matter’. (Note that the converse of this result is false: Dvoretzky and 
Rogers (1950) showed that every infinite-dimensional Banach space con- 
tains series that converge unconditionally but not absolutely.) 

(c) Suppose that V is a Banach space and that XlneN \\ v n\\ i s finite Show 

that Vn converges to some v G V. 

Exercise 3.6 (Weierstrass M-test). Let S be any set, let V be a Banach 
space, and, for each n G N, let f n : S V. Suppose that M n is such that 


|/ n (x)|| < M n for all x G S and n E N 


and that is finite Show that the series f n converges uni- 

formly on 5, i.e. there exists f : S V such that, for all e > 0, there exists 
N £ g N so that 


N 


N > N £ 


/O) ~^2fn{x) 


n — 1 


> sup 

x£S 


< £. 


Chapter 4 

Optimization Theory 


We demand rigidly defined areas of doubt and 
uncertainty! 


The Hitchhiker’s Guide to the Galaxy 

Douglas Adams 


This chapter reviews the basic elements of optimization theory and practice, 
without going into the fine details of numerical implementation. Many UQ 
problems involve a notion of ‘best fit’, in the sense of minimizing some error 
function, and so it is helpful to establish some terminology for optimiza- 
tion problems. In particular, many of the optimization problems in this book 
will fall into the simple settings of linear programming and least squares 
(quadratic programming), with and without constraints. 


4.1 Optimization Problems and Terminology 

In an optimization problem, the objective is to find the extreme values (either 
the minimal value, the maximal value, or both) f(x) of a given function / 
among all x in a given subset of the domain of /, along with the point or 
points x that realize those extreme values. The general form of a constrained 
optimization problem is 

extremize: f(x) 
with respect to: x E X 

subject to: gi(x) E Ei for i = 1, 2, . . . , 

where X is some set; f:X -A R U {± 00 } is a function called the objective 
function ; and, for each i, gi \ X -a 3^ is a function and Ei C 3^ some subset. 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-4 
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The conditions {gi(x) G Ei \ i = 1,2,...} are called constraints , and a point 
x G X for which all the constraints are satisfied is called feasible ; the set of 
feasible points, 

{x G X | gi(x) £ Ei for i = 1,2,...}, 

is called the feasible set. If there are no constraints, so that the problem is 
a search over all of T, then the problem is said to be unconstrained. In the 
case of a minimization problem, the objective function / is also called the 
cost function or energy ; for maximization problems, the objective function is 
also called the utility function. 

From a purely mathematical point of view, the distinction between con- 
strained and unconstrained optimization is artificial: constrained minimiza- 
tion over X is the same as unconstrained minimization over the feasible set. 
However, from a practical standpoint, the difference is huge. Typically, X is 
M n for some n, or perhaps a simple subset specified using inequalities on one 
coordinate at a time, such as [ai, bi\ x • • • x [a n , b n \] a bona fide non-trivial 
constraint is one that involves a more complicated function of one coordinate, 
or two or more coordinates, such as 

gi(x) := cos(t) — sin(x) > 0 


or 


g 2 (xi,x 2 ,x 3 ) := xix 2 - x 3 = 0. 


Definition 4.1. Given f:X R U {Too}, the arg min or set of global 
minimizers of / is defined to be 


/P)= 

and the arg max or set of global maximizers of / is defined to be 


arg min /(x) := < x G X 

xEX l 


arg max /(t) 


| x e X 


fix) 


sup f(x') 

x'ex 


Definition 4.2. For a given constrained or unconstrained optimization prob- 
lem, a constraint is said to be 

(a) redundant if it does not change the feasible set, and non-redundant or 
relevant otherwise; 

(b) non-binding if it does not change the extreme value, and binding other- 
wise; 

(c) active if it is an inequality constraint that holds as an equality at the 
extremizer, and inactive otherwise. 

Example 4.3. Consider /: M 2 — R, f(x,y) := y. Suppose that we wish to 
minimize / over the unbounded re-shaped region 

W := {(x,y) G K 2 | y > {x 2 - l) 2 }. 
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Over W, f takes the minimum value 0 at (x,y) = (±1,0). Note that the 
inequality constraint y > ( x 2 — l) 2 is an active constraint. The additional 
constraint y > 0 would be redundant with respect to this feasible set W, 
and hence also non-binding. The additional constraint x > 0 would be non- 
redundant, but also non-binding, since it excludes the previous minimizer at 
(x,y) = (—1,0) but not the one at (x,y) = (1,0). Similarly, the additional 
equality constraint y = ( x 2 — l) 2 would be non-redundant and non-binding. 

The importance of these concepts for UQ lies in the fact that many UQ 
problems are, in part or in whole, optimization problems: a good example 
is the calibration of parameters in a model in order to best explain some 
observed data. Each piece of information about the problem (e.g. a hypoth- 
esis about the form of the model, such as a physical law) can be seen as 
a constraint on that optimization problem. It is easy to imagine that each 
additional constraint may introduce additional difficulties in computing the 
parameters of best fit. Therefore, it is natural to want to exclude from consid- 
eration those constraints (pieces of information) that are merely complicating 
the solution process, and not actually determining the optimal parameters, 
and to have some terminology for describing the various ways in which this 
can occur. 


4.2 Unconstrained Global Optimization 

In general, finding a global minimizer of an arbitrary function is very hard , 
especially in high-dimensional settings and without nice features like convex- 
ity. Except in very simple settings like linear least squares (Section 4.6), it is 
necessary to construct an approximate solution, and to do so iteratively; that 
is, one computes a sequence ( x n ) ne ^ hr X such that x n converges as n — X oo 
to an extremizer of the objective function within the feasible set. A simple 
example of a deterministic iterative method for finding the critical points, 
and hence extrema, of a smooth function is Newton’s method: 

Definition 4.4. Let X be a normed vector space. Given a differentiable 
function g: X ^ X and an initial state xo, Newton’s method for finding a 
zero of g is the sequence generated by the iteration 

X n +1 ■= X n - ( Dg(x n )) g(x n ), (4.1) 

where D g(x n ) : A — x X is the Frechet derivative of g at x n . Newton’s method 
is often applied to find critical points of f:X — x R, i.e. points where D / 
vanishes, in which case the iteration is. 

X n +1 ■■= X n - (D 2 f(x n )) 1 Df(x n ). (4.2) 

(In (4.2), the second derivative (Hessian) D 2 f(x n ) is interpreted as a linear 
map A — x A rather than a bilinear map 4 x 4 4 R.) 
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Remark 4.5. (a) Newton’s method for the determination of critical points 
of / amounts to local quadratic approximation: we model / about x n 
using its Taylor expansion up to second order, and then take as x n+ i 
a critical point of this quadratic approximation. In particular, as shown 
in Exercise 4.3, Newton’s method yields the exact minimizer of / in one 
iteration when / is in fact a quadratic function. 

(b) We will not dwell at this point on the important practical issue of num- 
erical (and hence approximate) evaluation of derivatives for methods such 
as Newton iteration. However, this issue will be revisited in Section 10.2 
in the context of sensitivity analysis. 

For objective functions /: X -4MU {Too} that have little to no smooth- 
ness, or that have many local extremizers, it is often necessary to resort 
to random searches of the space X. For such algorithms, there can only be 
a probabilistic guarantee of convergence. The rate of convergence and the 
degree of approximate optimality naturally depend upon features like ran- 
domness of the generation of new elements of X and whether the extremizers 
of / are difficult to reach, e.g. because they are located in narrow Valleys’. We 
now describe three very simple random iterative algorithms for minimization 
of a prescribed objective function f , in order to illustrate some of the relevant 
issues. For simplicity, suppose that f has a unique global minimizer x_min 
and write f _min for f (x_min) . 

Algorithm 4.6 (Random sampling). For simplicity, the following algorithm 
runs for n_max steps with no convergence checks. The algorithm returns 
an approximate minimizer x_best along with the corresponding value of f . 
Suppose that random () generates independent samples of X from a proba- 
bility measure fi with support X. 

f_best = +inf 

n = 0 

while n < n_max: 
x_new = random () 
f_new = f (x_new) 
if f_new < f_best: 
x_best = x_new 
f_best = f _new 
n = n + 1 

return [x_best, f_best] 

A weakness of Algorithm 4.6 is that it completely neglects local informa- 
tion about f. Even if the current state x_best is very close to the global 
minimizer x_min, the algorithm may continue to sample points x_new that 
are very far away and have f (x_new) f (x_best) . It would be preferable to 
explore a neighbourhood of x_best more thoroughly and hence find a better 
approximation of [x_min, f _min] . The next algorithm attempts to rectify 
this deficiency. 
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Algorithm 4.7 (Random walk). As before, this algorithm runs for n_max 
steps. The algorithm returns an approximate minimizer x_best along with 
the corresponding value of f . Suppose that an initial state xO is given, and 
that jump() generates independent samples of A from a probability measure 
/I with support equal to the unit ball of A. 

x_best = xO 
f_best = f(x_best) 
n = 0 

while n < n_max: 

x_new = x_best + jump() 
f_new = f (x_new) 
if f_new < f_best: 
x_best = x_new 
f_best = f_new 
n = n + 1 

return [x_best, f_best] 

Algorithm 4.7 also has a weakness: since the state is only ever updated to 
states with a strictly lower value of f , and only looks for new states within 
unit distance of the current one, the algorithm is prone to becoming stuck in 
local minima if they are surrounded by wells that are sufficiently wide, even 
if they are very shallow. The next algorithm, the simulated annealing method 
of Kirkpatrick et al. (1983), attempts to rectify this problem by allowing the 
optimizer to make some ‘uphill’ moves, which can be accepted or rejected 
according to comparison of a uniformly distributed random variable with a 
user-prescribed acceptance probability function. Therefore, in the simulated 
annealing algorithm, a distinction is made between the current state x of 
the algorithm and the best state so far, x_best; unlike in the previous two 
algorithms, proposed states x_new may be accepted and become x even if 
f (x_new) > f (x_best) . The idea is to introduce a parameter T, to be thought 
of as ‘temperature’: the optimizer starts off ‘hot’, and ‘uphill’ moves are likely 
to be accepted; by the end of the calculation, the optimizer is relatively ‘cold’, 
and ‘uphill’ moves are unlikely to accepted. 

Algorithm 4.8 (Simulated annealing). Suppose that an initial state xO 
is given. Suppose also that functions temperature () , neighbour () and 
accept ance_prob() have been specified. Suppose that uniform () generates 
independent samples from the uniform distribution on [0, 1]. Then the simu- 
lated annealing algorithm is 

x = xO 
fx = f (x) 
x_best = x 
f_best = fx 
n = 0 

while n < n max: 
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T = temperature (n / n_max) 
x_new = neighbour (x) 
f_new = f (x_new) 

if acceptance_prob(fx, f_new, T) > uniformO : 
x = x_new 
fx = f _new 
if f_new < f_best: 
x_best = x_new 
f_best = f_new 
n = n + 1 

return [x_best, f_best] 


Like Algorithm 4.6, the simulated annealing method can guarantee to 
find the global minimizer of f provided that the neighbour () function 
allows full exploration of the state space and the maximum run time n_max 
is large enough. However, the difficulty lies in coming up with functions 
temperature () and acceptance_prob() such that the algorithm finds the 
global minimizer in reasonable time: simulated annealing calculations can 
be extremely computationally costly. A commonly used acceptance probabil- 
ity function P is the one from the Metropolis-Hastings algorithm (see also 
Section 9.5): 


P(e,e',T) 


1, if e' < e, 

exp( — {e! — e)/T ), if e' > e. 


There are, however, many other choices; in particular, it is not neces- 
sary to automatically accept downhill moves, and it is permissible to have 
P(e,e',T) < 1 for e' < e. 


4.3 Constrained Optimization 

It is well known that the unconstrained extremizers of smooth enough func- 
tions must be critical points, i.e. points where the derivative vanishes. The fol- 
lowing theorem, the Lagrange multiplier theorem, states that the constrained 
minimizers of a smooth enough function, subject to smooth enough equality 
constraints, are critical points of an appropriately generalized function: 

Theorem 4.9 (Lagrange multipliers). Let A and y be real Banach spaces. 
Let U C A be open and let f E C 1 (P;M). Let g E C 1 (/7; y), and suppose that 
x E U is a constrained extremizer of f subject to the constraint that g(x) = 0. 
Suppose also that the Frechet derivative D g{pc) : A y is surjective. Then 
there exists a Lagrange multiplier A E y r such that (x, A) is an unconstrained 
critical point of the Lagrangian C defined by 

U x y 3 (x, A) bT £(x, A) := f(x) + (A | g(x)) E R. 

i.e. D/(x) = — A o D g(x) as linear maps from A to R. 
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The corresponding result for inequality constraints is the Karush-Kuhn- 
Tucker theorem, which we state here for a finite system of inequality 
constraints: 

Theorem 4.10 (Karush-Kuhn-Tucker). Let U be an open subset of a 
Banach space X, and let f E C 1 (/7;M) and h E C 1 (U;W m ). Suppose that 
x E U is a local minimizer of f subject to the inequality constraints hi(x) < 0 
for i = 1, . . . , m, and suppose that D h(x ) : X —> M m is surjective. Then there 
exists p = (/ii, . . . , p m ) E (M m y such that 

— D f(x) =p D h(x), 

where p satisfies the dual feasibility criteria pi > 0 and the complementary 
slackness criteria pihi(x) = 0 for i = 1, . . . , m. 

The Lagrange and Karush-Kuhn-Tucker theorems can be combined to inc- 
orporate equality constraints gi and inequality constraints hj . Strictly speak- 
ing, the validity of the Karush-Kuhn-Tucker theorem also depends upon 
some regularity conditions on the constraints called constraint qualification 
conditions , of which there are many variations that can easily be found in the 
literature. A very simple one is that if gi and hj are affine functions, then no 
further regularity is needed; another is that the gradients of the active ine- 
quality constraints and the gradients of the equality constraints be linearly 
independent at the optimal point x. 

Numerical Implementation of Constraints. In the numerical treatment 
of constrained optimization problems, there are many ways to implement 
constraints, not all of which actually enforce the constraints in the sense of 
ensuring that trial states x_new, accepted states x, or even the final solution 
x_best are actually members of the feasible set. For definiteness, consider 
the constrained minimization problem 

minimize: f(x) 
with respect to: x E X 
subject to: c(x) < 0 

for some functions /, c: X R U {Too}. One way of seeing the constraint 
l c(x ) < O’ is as a Boolean true/false condition: either the inequality is sat- 
isfied, or it is not. Supposing that neighbour (x) generates new (possibly 
infeasible) elements of X given a current state x, one approach to generating 
feasible trial states x_new is the following: 

x* = neighbour (x) 
while c(x ; ) > 0: 

x J = neighbour (x) 
x_new = x ; 
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However, this accept/reject approach is extremely wasteful: if the feasible 
set is very small, then x ; will ‘usually’ be rejected, thereby wasting a lot 
of computational time, and this approach takes no account of how ‘nearly 
feasible’ an infeasible might be. 

One alternative approach is to use penalty functions: instead of considering 
the constrained problem of minimizing f(x) subject to c(x) < 0, one can 
consider the unconstrained problem of minimizing x i— >> f(x) + p(x), where 
p: X —> [0, oo ) is some function that equals zero on the feasible set and takes 
larger values the ‘more’ the constraint inequality c(x) < 0 is violated, e.g., 
for ju > 0. 

0, if c(x) < 0, 

exp (c(x)/p) — 1, if c(x) > 0. 


Pu,(x) = 


The hope is that (a) the minimization of / -f-p M over all of X is easy, and (b) 
as p 0, minimizers of / + converge to minimizers of / on the original 
feasible set. The penalty function approach is attractive, but the choice of 
penalty function is rather ad hoc, and issues can easily arise of competition 
between the penalties corresponding to multiple constraints. 

An alternative to the use of penalty functions is to construct constraining 
functions that enforce the constraints exactly. That is, we seek a function C() 
that takes as input a possibly infeasible x* and returns some x_new = C(x J ) 
that is guaranteed to satisfy the constraint c(x_new) <= 0. For example, 
suppose that X = M n and the feasible set is the Euclidean unit ball, so the 
constraint is 

c(x) := \\x\\l — 1 < 0. 


Then a suitable constraining function could be 


C(x) 


fx, 

if \\x\\ 

\x/ \\x | 2 , 

if \\x\\ 


2 T 


< 1 


Constraining functions are very attractive because the constraints are treated 
exactly. However, they must often be designed on a case-by-case basis for each 
constraint function c, and care must be taken to ensure that multiple con- 
straining functions interact well and do not unduly favour parts of the feasible 
set over others; for example, the above constraining function C maps the en- 
tire infeasible set to the unit sphere, which might be considered undesirable 
in certain settings, and so a function such as 


C(x) 


x, if ||t|| 2 < 1, 

x/||x|||, if ||t||2 > 1. 


might be more appropriate. Finally, note that the original accept /reject 
method of finding feasible states is a constraining function in this sense, 
albeit a very inefficient one. 
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4.4 Convex Optimization 

The topic of this section is convex optimization. As will be seen, convexity is 
a powerful property that makes optimization problems tractable to a much 
greater extent than any amount of smoothness (which still permits local 
minima) or low-dimensionality can do. 

In this section, A will be a normed vector space. (More generally, the 
properties that are of importance to the discussion hold for any Hausdorff, 
locally convex topological vector space.) Given two points x$ and x\ of A 
and t G [0, 1], x t will denote the convex combination 


x t := (1 — t)x o + tx i. 


More generally, given points Xq, . . . , x n of a vector space, a sum of the form 


ao^o + * ■ ■ T OL n x n 

is called a linear combination if the cq are any field elements, an affine com- 
bination if their sum is 1, and a convex combination if they are non- negative 
and sum to 1. 

Definition 4.11. (a) A subset K C A is a convex set if, for all x$,x\ G K 
and t G [0,1], Xt G K; it is said to be strictly convex if x t G K whenever 
xq and X\ are distinct points of K and t G (0,1). 

(b) An extreme point of a convex set A" is a point of K that cannot be written 
as a non-trivial convex combination of distinct elements of K\ the set of 
all extreme points of K is denoted ext (K). 

(c) The convex hull co(S) (resp. closed convex hull c o(S)) of S C A is defined 
to be the intersection of all convex (resp. closed and convex) subsets of 
A that contain S. 

Example 4.12. (a) The square [ — 1 , 1] 2 is a convex subset of M 2 , but is not 
strictly convex, and its extreme points are the four vertices (±1, ±1). 

(b) The closed unit disc {(x,y) G M 2 | x 2 + y 2 < 1} is a strictly convex 
subset of M 2 , and its extreme points are the points of the unit circle 
{(x, y) G M 2 | x 2 + y 2 = 1}. 

(c) If pee • • • iPd G A are distinct points such that pi — po, ••• ,Pd — Po 
are linearly independent, then their (closed) convex hull is called a 
d- dimensional simplex. The points po? • • • iPd are the extreme points of 
the simplex. 

(d) See Figure 4.1 for further examples. 

Example 4.13. Ali(A) is a convex subset of the space of all (signed) Borel 
measures on A. The extremal probability measures are the zero- one mea- 
sures , i.e. those for which, for every measurable set E C A, p(E) G {0, 1}. 
Furthermore, as will be discussed in Chapter 14, if A is, say, a Polish space, 


64 


4 Optimization Theory 


a 



• • 

A convex set (grey) and its set of 
extreme points (black). 

Fig. 4.1: Convex sets, extreme points 
plane M 2 . 


b 



A non-convex set (black) and its 
convex hull (grey). 

convex hulls of some subsets of the 


then the zero-one measures (and hence the extremal probability measures) 
on X are the Dirac point masses. Indeed, in this situation, 

M^X) = co({S x | x e X}) c M±(X). 

The principal reason to confine attention to normed spaces 1 X is that it 
is highly inconvenient to have to work with spaces for which the following 
‘common sense’ results do not hold: 

Theorem 4.14 (Krem-Milman). Let K C X be compact and convex. Then 
K is the closed convex hull of its extreme points. 

Theorem 4.15 (Choquet-Bishop-de Leeuw). Let K C X be compact and 
convex , and let c E K . Then there exists a probability measure p supported 
on ext (K) such that, for all affine functions f on K , 

/(c) = f /(e)dp(e). 

J ext (K) 

The point c in Theorem 4.15 is called a barycentre of the set K , and the 
probability measure p is said to represent the point c. Informally speaking, the 
Krem-Milman and Choquet-Bishop-de Leeuw theorems together ensure that 
a compact, convex subset K of a topologically respectable space is entirely 
characterized by its set of extreme points in the following sense: every point 
of K can be obtained as an average of extremal points of K, and, indeed, the 
value of any affine function at any point of K can be obtained as an average 
of its values at the extremal points in the same way. 


1 Or, more generally, Hausdorff, locally convex, topological vector spaces. 
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Definition 4.16. Let K C X be convex. A function f : K — y R U {Too} is 
a convex function if, for all xq,x\ G K and t G [0, 1], 

f{x t ) < (l-t)f{x 0 ) + tf(x 1 ), (4.3) 

and is called a strictly convex function if, for all distinct xq,x\ G K and 

t C (o? I)? 

f(x t ) < (1 - t)f (x 0 ) + tf(x 1 ). 

The inequality (4.3) defining convexity can be seen as a special case - 
with X ~ /i supported on two points xq and x\ — of the following result: 

Theorem 4.17 (Jensen). Let (0, /r) be a probability space, let K C A 
and f : K W U {± 00 } be convex, and let X G L 1 (0,/r; X) take values in 
K. Then 

/(E m [V])<E 4/00], (4.4) 

where E M [X] ^ X is defined by the relation (£ | E M [X]) = E M [(^ | A)] for every 
£ G A 7 . Furthermore, if f is strictly convex, then equality holds in (4.4) if 
and only if X is p-almost surely constant. 

It is straightforward to see that / : K MU{±oo} is convex (resp. strictly 
convex) if and only if its epigraph 

epi (/) := {(x,v) G K x R | v > f(x)} 

is a convex (resp. strictly convex) subset of K x R. Furthermore, twice- 
differentiable convex functions are easily characterized in terms of their sec- 
ond derivative (Hessian): 

Theorem 4.18. Let f:K R be twice continuously differentiable on an 
open, convex set K . Then f is convex if and only ifD 2 f(x) is positive semi- 
definite for all x G K. IfD 2 f(x) is positive definite for all x G K, then f is 
strictly convex, though the converse is false. 

Convex functions have many convenient properties with respect to mini- 
mization and maximization: 

Theorem 4.19. Let f : K R be a convex function on a convex set K C X. 
Then 

(a) any local minimizer of f in K is also a global minimizer; 

(b) the set argmin^ / of global minimizers of f in K is convex; 

(c) if f is strictly convex, then it has at most one global minimizer in K ; 

(d) f has the same maximum values on K and ext (K). 

Proof, (a) Suppose that xo is a local minimizer of / in K that is not a 
global minimizer: that is, suppose that xo is a minimizer of / in some 
open neighbourhood N of xq, and also that there exists x\ G K \ N 
such that f(x 1 ) < f(x 0 ). Then, for sufficiently small t > 0, x t G N, but 
convexity implies that 
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f{x t ) < (1 - t)f(x o) + tf(x i) < (1 - t)f(x o) + i/po) = f(xo), 

which contradicts the assumption that Xo is a minimizer of / in N. 

(b) Suppose that Xq,xi E if are global minimizers of /. Then, for all t E 
[0,1], x t E if and 

fix o) < /(art) < (1 - t)f(x 0 ) + tf(xi) = f(x 0 ). 

Hence, or E argmin^ /, and so argmin^ / is convex. 

(c) Suppose that Xq,xi E if are distinct global minimizers of /, and let 
t E (0, 1). Then or E if and 

fix o) < /(art) < (1 - t)fix o) + i/pi) = /(ar 0 ), 


which is a contradiction. Hence, / has at most one minimizer in if. 

(d) Suppose that c E if \ ext (if) has /(c) > sup ext (^) /• By Theorem 4.15, 
there exists a probability measure p on ext(JT) such that, for all affine 
functions f on if, 

i(c) = f £(x) d p(x). 

J ext (K) 

i.e. c = Kx~ p [X]. Then Jensen’s inequality implies that 

[fix)] > /(c) > sup /, 

ext (K) 

which is a contradiction. Hence, since sup x / > sup ext ( K ) /, / must have 
the same maximum value on ext (if) as it does on if . □ 


Remark 4.20. Note well that Theorem 4.19 does not assert the existence of 
minimizers, which requires non-emptiness and compactness of if, and lower 
semicontinuity of /. For example: 

• the exponential function on R is strictly convex, continuous and bounded 
below by 0 yet has no minimizer; 

• the interval [—1, 1] is compact, and the function /: [—1, 1] R U {Too} 
defined by 


fix) 


f X, 

if \x\ 


if \x\ 


> I 


2 ’ 
1 

2 ’ 


is convex, yet / has no minimizer — although inf a , e [_ 11 ] f{pc) = — 
there is no x for which f(x) attains this infimal value. 


Definition 4.21. A convex optimization problem (or convex program ) is a 
minimization problem in which the objective function and all constraints are 
equalities or inequalities with respect to convex functions. 


Remark 4.22. (a) Beware of the common pitfall of saying that a convex 
program is simply the minimization of a convex function over a convex 
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set. Of course, by Theorem 4.19, such minimization problems are nicer 
than general minimization problems, but bona fide convex programs are 
an even nicer special case. 

(b) In practice, many problems are not obviously convex programs, but can 
be transformed into convex programs by, e.g., a cunning change of vari- 
ables. Being able to spot the right equivalent problem is a major part of 
the art of optimization. 

It is difficult to overstate the importance of convexity in making optimiza- 
tion problems tractable. Indeed, it has been remarked that lack of convexity 
is a much greater obstacle to tractability than high dimension. There are 
many powerful methods for the solution of convex programs, with corre- 
sponding standard software libraries such as cvxopt. For example, interior 
point methods explore the interior of the feasible set in search of the solution 
to the convex program, while being kept away from the boundary of the fea- 
sible set by a barrier function. The discussion that follows is only intended 
as an outline; for details, see Boyd and Vandenberghe (2004, Chapter 11). 

Consider the convex program 

minimize: f(x) 
with respect to: xGl n 

subject to: Ci{pc) < 0 for i = 1, . . . , m, 

where the functions /, ci, . . . , c m : M n R are all convex and differentiable. 
Let F denote the feasible set for this program. Let 0 < /i C 1 be a small 
scalar, called the barrier parameter, and define the barrier function associated 
to the program by 


rri 

B(x\n) := f(x) -M^log 

i— 1 

Note that B{ • ; fi) is strictly convex for fi > 0, that B(x\n) —> +oo asr dF , 

o 

and that B( • ; 0) = /; therefore, the unique minimizer x* of B( • ; /i) lies in F 
and (hopefully) converges to the minimizer of the original problem as p 0. 
Indeed, using arguments based on convex duality, one can show that 

fix*) - ini \ fix) < mu. 

^ xGF 

The strictly convex problem of minimizing L>(-;/i) can be solved approxi- 
mately using Newton’s method. In fact, however, one settles for a partial 
minimization of B(-;/a) using only one or two steps of Newton’s method, 
then decreases fi to //, performs another partial minimization of 5(-;/i / ) 
using Newton’s method, and so on in this alternating fashion. 
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4.5 Linear Programming 

Theorem 4.19 has the following immediate corollary for the minimization and 
maximization of affine functions on convex sets: 

Corollary 4.23. Let £: K R be a continuous affine function on a non- 
empty, compact, convex set K C X . Then 

ext{£(x) | x G K} = ext{£(x) \ x G ext (if)}. 

That is, £ has the same minimum and maximum values over both K and the 
set of extreme points of K . 

Definition 4.24. A linear program is an optimization problem of the form 

extremize: /(x) 
with respect to: 

subject to: gt(x) < 0 for i = 1, . . . , q, 

where the functions f,gi,...,g q :W — >■ R are all affine functions. Linear 
programs are often written in the canonical form 

maximize: c • x 
with respect to: x G M n 
subject to: Ax < b 

x > 0, 

where c G M n , A G M mXn and b G M m are given, and the two inequalities are 
interpreted componentwise. (Conversion to canonical form, and in particular 
the introduction of the non- negativity constraint x > 0, is accomplished 
by augmenting the original x G with additional variables called slack 
variables to form the extended variable x G M n .) 

Note that the feasible set for a linear program is an intersection of finitely 
many half-spaces of M n , i.e. a polytope. This polytope may be empty, in which 
case the constraints are mutually contradictory and the program is said to 
be infeasible. Also, the polytope may be unbounded in the direction of c, in 
which case the extreme value of the problem is infinite. 

Since linear programs are special cases of convex programs, methods such 
as interior point methods are applicable to linear programs as well. Such 
methods approach the optimum point x*, which is necessarily an extremal 
element of the feasible polytope, from the interior of the feasible poly- 
tope. Historically, however, such methods were preceded by methods such 
as Dantzig’s simplex algorithm, which sets out to directly explore the set of 
extreme points in a (hopefully) efficient way. Although the theoretical worst- 
case complexity of simplex method as formulated by Dantzig is exponential 
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in n and m, in practice the simplex method is remarkably efficient (typically 
having polynomial running time) provided that certain precautions are taken 
to avoid pathologies such as ‘stalling’. 


4.6 Least Squares 

An elementary example of convex programming is unconstrained quadratic 
minimization, otherwise known as least squares. Least squares minimization 
plays a central role in elementary statistical estimation, as will be demon- 
strated by the Gauss-Markov theorem (Theorem 6.2). The next three results 
show that least squares problems have unique solutions, which are given in 
terms of an orthogonality criterion, which in turn reduces to a system of 
linear equations, the normal equations. 

Lemma 4.25. Let K be a non-empty, closed, convex subset of a Hilbert space 
H. Then, for each y E Ti, there is a unique element x = 77 fey £ K such that 

x E arg min 1 1 y — x | | . 

x£K 

Proof. By Exercise 4.1, the function J: X [0, oo) defined by J(x) := 
\\y — x || 2 is strictly convex, and hence it has at most one minimizer in K. 
Therefore, it only remains to show that J has at least one minimizer in 
K. Since J is bounded below (on T, not just on K ), J has a sequence of 
approximate minimizer s: let 

I := inf II y - x\\ 2 , 1 2 < ||y - x n \\ 2 < I 2 + 4. 

By the parallelogram identity for the Hilbert norm || • ||, 

II (y-x m ) + (y-x n ) II 2 + II (y-x m ) - (y-x n ) || 2 = 2\\y - x m \\ 2 + 2\\y - x n \\ 2 , 
and hence 


||2 y — (x m + ^n)|| 2 + || %n ~ ^ m || 2 < 4/ 2 + “ + 

Since K is convex, ^(x m + x n ) E K , so the first term on the left-hand side 
above is bounded below as follows: 


| (%m ~t~ % n ) 



2 


2 

> 4/ 2 . 


Hence, 


x 


n 


X 


m 


< 4/2 _|_ 2 _|_ -2- _ 4J 2 = 2 
n jyi n m > 


2 _ 

m 


and so the sequence ( x n ) ne ^ is Cauchy; since H is complete and K is closed, 
this sequence converges to some x E K. Since the norm || • || is continuous, 
|| y — x\\ =7. □ 
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Lemma 4.26 (Orthogonality of the residual). Let V be a closed subspace of 
a Hilbert space H and let b E H. Then x G V minimizes the distance to b if 
and only if the residual x — b is orthogonal to V, i.e. 


x = argmin \\x — b\\ 4=4 (x — b) 1 V. 

xClV 

Proof. Let J(x) := \\\x — b\\ 2 , which has the same minimizers as a: 4 
\\x — b ||; by Lemma 4.25, such a minimizer exists and is unique. Suppose that 
(x — b) _L V and let y E V. Then y — x G V and so (y — x) T (x — b). Hence, 
by Pythagoras’ theorem, 

\\y — b\\ 2 = \\y — x|| 2 -T ||x — b\\ 2 > \\x — b\\ 2 , 


and so x minimizes J. 

Conversely, suppose that x minimizes J. Then, for every y E V, 


d 

0= — J(x + Xy) 

A=0 

and, in the complex case, 
d 


1 

= o (w> x - b ) + \ x - v)) = ~ b -> y) 


0= — J(x + Xiy) 


A=0 


1 

2 (-*(y> x-b) + i(x- b , y)) = - Im(a; - b, y) 


Hence, (x — b,y) = 0, and since y was arbitrary, (x — b) 1 V. □ 

Lemma 4.27 (Normal equations). Let A: Ti JC be a linear operator 
between Hilbert spaces such that ran A C JC is closed. Then, given b E JC, 


x G argmin \\Ax — b\\jc A*Ax = A*b, (4.5) 

xGXL 

the equations on the right-hand side being known as the normal equations. 
If, in addition, A is injective, then A* A is invertible and the least squares 
problem / normal equations have a unique solution. 

Proof. As a consequence of completeness, the only element of a Hilbert space 
that is orthogonal to every other element of the space is the zero element. 
Hence, 


| Ax — b\\jc is minimal 

(Ax — b) T Av for all v E H by Lemma 4.26 

<^=4> (Ax — b , Av)ic = 0 for all v E H 
4=^ (A* Ax — A*b , v)'u = 0 for all v E H 

<=> A* Ax = A* b by completeness of TL, 


and this shows the equivalence (4.5). 
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By Proposition 3.16(d), ker A* = (ran A) A Therefore, the restriction of A * 
to the range of A is injective. Hence, if A itself is injective, then it follows that 
A* A is injective. Again by Proposition 3.16(d), (ran A*)^ = ker A = {0}, and 
since TL is complete, this implies that A* is surjective. Since A is surjective 
onto its range, it follows that A*A is surjective, and hence bijective and 
invertible. □ 

Weighting and Regularization. It is common in practice that one does 
not want to minimize the /C-norm directly, but perhaps some re-weighted 
version of the /C-norm. This re-weighting is accomplished by a self-adjoint 
and positive definite 2 operator Q : JC —> JC: we define a new inner product 
and norm on /C by 


(k,k') Q := (k,Qk') K , 

Mq:= <M)q /2 . 

It is a standard fact that the self-adjoint operator Q possesses an operator 
square root, i.e. a self-adjoint Q 1 ^ 2 : JC /C such that Q^^Q}! 2 — Q; for 
reasons of symmetry, it is common to express the inner product and norm 
induced by Q using this square root: 


<M J ) Q = {Q 1/2 k,Q l / 2 k') K , 




K' 


We then consider the problem, given b G /C, of finding x G TL to minimize 

\\\ Ax - Hq = \\\ Q 1/2 (Ax - b ) liy 


Another situation that arises frequently in practice is that the normal 
equations do not have a unique solution (e.g. because A* A is not invertible) 
and so it is necessary to select one by some means, or that one has some 
prior belief that ‘the right solution’ should be close to some initial guess xq. 
A technique that accomplishes both of these aims is Tikhonov regularization 
(known in the statistics literature as ridge regression). In this situation, we 
minimize the following sum of two quadratic functionals: 

1 1 

^\\ Ax -Hk+ ^\\ x ~ x o\\% 

where R: TL — TL is self-adjoint and positive definite, and xq G TL. 


2 If Q is not positive definite, but merely positive semi-definite and self-adjoint, then 
existence of solutions to the associated least squares problems still holds, but uniqueness 
can fail. 
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These two modifications to ordinary least squares, weighting and regular- 
ization, can be combined. The normal equations for weighted and regularized 
least squares are easily derived from Lemma 4.27: 

Theorem 4.28 (Normal equations for weighted and Tikhonov-regularized 
least squares). Let Li and JC be Hilbert spaces, let A: H — x JC have closed 
range, let Q and R be self-adjoint and positive definite on JC and H respec- 
tively, and let b E JC, xo E TL. Let 

l i 

J{ x ) ■= lp Ax - H Q + 2 Ih “ X o\\r- 

Then 

x E argmin J(x) (A*QA + R)x = A*Qb + Rxq. 

xGXL 

Proof. Exercise 4.4. □ 

It is also interesting to consider regularizations that do not come from a 
Hilbert norm, but instead from some other function. As will be elaborated 
upon in Chapter 6, there is a strong connection between regularized opti- 
mization problems and inverse problems, and the choice of regularization in 
some sense describes the practitioner’s ‘prior beliefs’ about the structure of 
the solution. 

Nonlinear Least Squares and Gauss— Newton Iteration. It often occurs 
in practice that one wishes to find a vector of parameters such that a 

function 3 r H f(x; 0) E best fits a collection of data points {(or, yf) E 
xR 1 | i = 1, ... , m}. For each candidate parameter vector 0, define the 
residual vector 



n (0) 


Vi - f(xi;0) 

TmiV 


Vra f i x mi ^ ) 


E 


>ra 


The aim is to find 6 to minimize the objective function J(6) := ||r(0)||2. Let 


A :■ 


dr i ( 6 ) 

dr i ( 0 ) 

80 1 
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5r m (6>) 

drm{6) 
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)inxp 


0 — 6 , 


be the Jacobian matrix of the residual vector, and note that A = —DF(0 n ), 
where 

fix i;0) 


m ■■ 


f(x m ',0) 


G 


i m 
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Consider the first-order Taylor approximation 

r{9) « r{6 n ) + A(r(6) - r(8 n )). 

Thus, to approximately minimize ||r'(6>) || 2 , we find 5 := r(8) — r(9 n ) that 
makes the right-hand side of the approximation equal to zero. This is an 
ordinary linear least squares problem, the solution of which is given by the 
normal equations as 

S = {A*A)- 1 A*r(e n ). 

Thus, we obtain the Gauss-Newton iteration for a sequence of app- 

roximate minimizers of J: 

8 n+1 := 6 n - (A*A)~ 1 A*r(8 n ) 

= 6 n + ((DF(0„))*(DF(0„))) _1 (DF(0„))V(0„). 

In general, the Gauss-Newton iteration is not guaranteed to converge to 
the exact solution, particularly if 5 is ‘too large’, in which case it may be 
appropriate to use a judiciously chosen small positive multiple of S. The 
use of Tikhonov regularization in this context is known as the Levenberg- 
Marquardt algorithm or trust region method, and the small multiplier applied 
to S is essentially the reciprocal of the Tikhonov regularization parameter. 
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ticularly for the UQ methods in Chapter 14, variations upon the genetic 
evolution approach, e.g. the differential evolution algorithm (Price et ah, 
2005; Storn and Price, 1997), have proved up to the task of producing robust 
results, if not always quick ones. There is no ‘one size fits all’ approach to 
constrained global optimization: it is basically impossible to be quick, robust, 
and general all at the same time. 
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In practice, it is very useful to work using an optimization framework that 
provides easy interfaces to many optimization methods, with easy interchange 
among strategies for population generation, enforcement of constraints, ter- 
mination criteria, and so on: see, for example, the DAKOTA (Adams et ah, 
2014) and Mystic (McKerns et ah, 2009, 2011) projects. 


4.8 Exercises 

Exercise 4.1. Let || • || be a norm on a vector space V, and fix x E V. Show 
that the function J: V -T [0, oo) defined by J(x) := \\x — x\\ is convex, and 
that J(x) := \\\x — x \\ 2 is strictly convex if the norm is induced by an inner 
product. Give an example of a norm for which J{x) := \\\x — x\\ 2 is not 
strictly convex. 

Exercise 4.2. Let K be a non-empty, closed, convex subset of a Hilbert space 
77. Lemma 4.25 shows that there is a well-defined function 77 k : W, K that 
assigns to each y E 77 the unique 77 kV £ 77 that is closest to y with respect 
to the norm on 77. 

(a) Prove the variational inequality that x = Ilxy if and only if x E K and 

(x, z — x) > (y,z — x) for all z E K. 

(b) Prove that 77 k is non-expansive, i.e. 

\\n K yi - n K y 2 \\ < llyi - y 2 \\ for all y 1 ,y 2 G U, 
and hence a continuous function. 

Exercise 4.3. Let A: 77 1C be a linear operator between Hilbert spaces 
such that ran A is a closed subspace of /C, let Q : JC /C be self-adjoint and 
positive-definite, and let b E JC. Let 

J(x) ■= \\\Ax - fe || q 

Calculate the gradient and Hessian (second derivative) of J. Hence show 
that, regardless of the initial condition xo E 77, Newton’s method finds the 
minimum of J in one step. 

Exercise 4.4. Prove Theorem 4.28. Hint: Consider the operator from 77 into 
/C © C given by 

Q 1 / 2 Ax 
R x ' 2 x 
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Chapter 5 

Measures of Information 
and Uncertainty 


As we know, there are known knowns. There 
are things we know we know. We also know 
there are known unknowns. That is to say we 
know there are some things we do not know. 
But there are also unknown unknowns, the 
ones we don’t know we don’t know. 


Donald Rumsfeld 


This chapter briefly summarizes some basic numerical measures of unce- 
rtainty, from interval bounds to information-theoretic quantities such as 
(Shannon) information and entropy. This discussion then naturally leads to 
consideration of distances (and distance-like functions) between probability 
measures. 


5.1 The Existence of Uncertainty 

At a very fundamental level, the first level in understanding the uncertainties 
affecting some system is to identify the sources of uncertainty. Sometimes, 
this can be a challenging task because there may be so much lack of knowledge 
about, e.g. the relevant physical mechanisms, that one does not even know 
what a list of the important parameters would be, let alone what uncertainty 
one has about their values. The presence of such so-called unknown unknowns 
is of major concern in high-impact settings like risk assessment. 

One way of assessing the presence of unknown unknowns is that if one 
subscribes to a deterministic view of the universe in which reality maps inputs 
x E X to outputs y = f(pc) E A by a well-defined single- valued function 


(c) Springer International Publishing Switzerland 2015 
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/: X y, then unknown unknowns are additional variables z E Z whose 
existence one infers from contradictory observations like 

f(x) = yi and f( x )=y 2 ^y 1 . 

Unknown unknowns explain away this contradiction by asserting the exis- 
tence of a space Z containing distinct elements z\ and Z2, that in fact / is a 
function / : X x Z y, and that the observations were actually 

f(x,z 1 )=y 1 and f(x,z 2 ) = y 2- 

Of course, this viewpoint does nothing to actually identify the relevant space 
Z nor the values z\ and Z2- 

A related issue is that of model form uncertainty , i.e. an epistemic lack 
of knowledge about which of a number of competing models for some sys- 
tem of interest is ‘correct’. Usually, the choice to be made is a qualitative 
one. For example, should one model some observed data using a linear or 
a non-linear statistical regression model? Or, should one model a fluid flow 
through a pipe using a high-fidelity computational fluid dynamics model in 
three spatial dimensions, or using a coarse model that treats the pipe as 
one-dimensional? This apparently qualitative choice can be rendered into a 
quantitative form by placing a Bayesian prior on the discrete index set of 
the models, conditioning upon observed data, and examining the resulting 
posterior. However, it is important to not misinterpret the resulting posterior 
probabilities of the models: we do not claim that the more probable model 
is ‘correct’, only that it has relatively better explanatory power compared to 
the other models in the model class. 


5.2 Interval Estimates 

Sometimes, nothing more can be said about some unknown quantity than 
a range of possible values, with none more or less probable than any other. 
In the case of an unknown real number x, such information may boil down 
to an interval such as [a, b } in which x is known to he. This is, of course, a 
very basic form of uncertainty, and one may simply summarize the degree of 
uncertainty by the length of the interval. 

Interval Arithmetic. As well as summarizing the degree of uncertainty 
by the length of the interval estimate, it is often of interest to manipulate 
the interval estimates themselves as if they were numbers. One method of 
manipulating interval estimates of real quantities is interval arithmetic. Each 
of the basic arithmetic operations * E is extended to intervals 

A, B C R by 

A * B := {x E R | x = a * b for some a E A, b E B}. 
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Hence, 


[a, b] + [c, d] 
[a, 6] — [c, d] 
[a, b] • [c, d] 
[a,6]/[c,d] 


[n ~h c, b -}- dj , 

[a — d, b — c ] , 

min{a • c, a • d, b • c, b • d}, max{a • c,a • d,b • c,b • d}] , 
min{a/c, a/d,b/c,b/d }, max{a/c, a/d, 6/c, 6/d}], 


where the expression for [a, 6]/[c, d] is defined only when 0 [c, d]. The 

addition and multiplication operations are commutative, associative and sub- 
distributive: 

A(H + d)Cifi + AC. 

These ideas can be extended to elementary functions without too much dif- 
ficulty: monotone functions are straightforward, and the Intermediate Value 
Theorem ensures that the continuous image of an interval is again an interval. 
However, for general functions /, it is not straightforward to compute (the 
convex hull of) the image of /. 

Interval analysis corresponds to a worst-case propagation of uncertainty: 
the interval estimate on the output / is the greatest lower bound and least 
upper bound compatible with the interval estimates on the input of /. How- 
ever, in practical settings, one shortcoming of interval analysis is that it 
can yield interval bounds on output quantities of interest that are too pes- 
simistic (i.e. too wide) to be useful: there is no scope in the interval arithmetic 
paradigm to consider how likely or unlikely it would be for the various inputs 
to ‘conspire’ in a highly correlated way to produce the most extreme output 
values. (The heuristic idea that a function of many independent or weakly 
correlated random variables is unlikely to stray far from its mean or median 
value is known as the concentration of measure phenomenon, and will be dis- 
cussed in Chapter 10.) In order to produce more refined interval estimates, 
one will need further information, usually probabilistic in nature, on possible 
correlations among inputs. 

‘Intervals’ of Probability Measures. The distributional robustness appr- 
oaches covered in Chapter 14 — as well as other theories of imprecise 
probability, e.g. Dempster-Shafer theory — can be seen as an extension 
of the interval arithmetic approach from partially known real numbers to 
partially known probability measures. As hybrid interval-probabilistic app- 
roaches, they are one way to resolve the ‘overly pessimistic’ shortcomings of 
classical interval arithmetic as discussed in the previous paragraph. These 
ideas will be revisited in Chapter 14. 
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5.3 Variance, Information and Entropy 

Suppose that one adopts a subjectivist (e.g. Bayesian) interpretation of 
probability, so that one’s knowledge about some system of interest with pos- 
sible values in X is summarized by a probability measure fi E Xii(X). The 
probability measure /x is a very rich and high-dimensional object; often it is 
necessary to summarize the degree of uncertainty implicit in /x with a few 
numbers — perhaps even just one number. 

Variance. One obvious summary statistic, when X is (a subset of) a normed 
vector space and /x has mean m, is the variance of /x, i.e. 



If V(/x) is small (resp. large), then we are relatively certain (resp. uncertain) 
that X rsj fi is in fact quite close tom. A more refined variance-based measure 
of informativeness is the covariance operator 


C(/x) := [{X - m) ® (X — m) 


A distribution /x for which the operator norm of C(/x) is large may be said to 
be a relatively uninformative distribution. Note that when A = M n , C(/x) is 
an n x n symmetric positive-semi-definite matrix. Hence, such a C(/x) has n 
positive real eigenvalues (counted with multiplicity) 


Ai > A 2 > • • • > A n > 0 


with corresponding normalized eigenvectors ui,...,u n E M n . The direction 
Vi corresponding to the largest eigenvalue Ai is the direction in which the 
uncertainty about the random vector X is greatest; correspondingly, the dir- 
ection v n is the direction in which the uncertainty about the random vector 
X is least. 

A beautiful and classical result concerning the variance of two quantities of 
interest is the uncertainty principle from quantum mechanics. In this setting, 
the probability distribution is written as p = |^| 2 , where ^ is a unit-norm 
element of a suitable Hilbert space, usually one such as L 2 (M n ;C). Physical 
observables like position, momentum, etc. act as self-adjoint operators on this 
Hilbert space; e.g. the position operator Q is 


(Q , ip)(x) := x'ip(x) 


so that the expected position is 
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In general, for a fixed unit-norm element ip E TL, the expected value (A) and 
variance Y(A) = g\ of a self-adjoint operator A: TL TL are defined by 

(a) := {tp,Aip), 
a\:= ((A- (A)) 2 }. 

The following inequality provides a fundamental lower bound on the product 
of the variances of any two observables A and B in terms of their commutator 
[A, B] := AB — BA and their anti-commutator {A, B} := AB + BA. When 
this lower bound is positive, the two variances cannot both be close to zero, 
so simultaneous high-precision measurements of A and B are impossible. 

Theorem 5.1 (Uncertainty principle: Schrodinger’s inequality). Let A, B be 
self-adjoint operators on a Hilbert space TL, and let ip E H have unit norm. 
Then 


2 2 ^ 

({A B}) — 2(A) (B) 
2 

2 

+ 

({AB}) 

2 

and, a fortiori, <j a® bPl 

||([ab]>|. 



Proof. Let / := (A — (A))ip and g := (B — (B))ip, so that 


°i = (/>/) = 11/ 

1 2 

5 



°b = (g>g) = lid 

2 



Therefore, by the Cauchy-Schwarz inequality (3.1), 

oao% = Wffhf > \(f,g)\ 2 - 

Now write the right-hand side of this inequality as 
l(/»5)| 2 = (Re((/, 5 >)) 2 + (lm( (/,g))) 2 

= f {f,g) + ( g,f ) \ 2 + / if, g) - {g, f ) \ 2 

Using the self-adjointness of A and B, 

(f,g) = ((A-(A))iP,(B-(B))iP) 

= (AB) - (A)(B) - (A)(B) + (A)(B) 

= (AB) - (A)(B); 

similarly, (g,f) = (BA) — (A)(B). Hence, 

{f,g)-(g,f) = ([A,B]), 

(f,g) + (gJ) = ({A,B}) -2(A)(B), 

which yields (5.1). □ 
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An alternative measure of information content, not based on variances, is 
the information-theoretic notion of entropy: 

Information and Entropy. In information theory as pioneered by Claude 
Shannon, the information (or surprisal ) associated with a possible outcome 
x of a random variable A ~ p taking values in a finite set A is defined to be 

I(x) := -logPx~ M [X = x\ = - \ogn(x). (5.2) 

Information has units according to the base of the logarithm used: 

base 2 bits , base e nats /nits, base 10 bans/dits/hartleys. 

The negative sign in (5.2) makes I(x) non-negative, and logarithms are used 
because one seeks a quantity /( • ) that represents in an additive way the 
‘surprise value’ of observing x. For example, if x has half the probability of p, 
then one is ‘twice as surprised’ to see the outcome X = x instead of X = y, 
and so I {pc) = I{y) + log 2. The entropy of the measure p is the expected 
information: 


:= E x ~»[I(X)\ = ~^2 U x ) log (5.3) 

X^X 

(We follow the convention that OlogO := \imp->op\ogp = 0.) These defini- 
tions are readily extended to a random variable X taking values in M n and 
distributed according to a probability measure p that has Lebesgue density p: 

I(x) := - log p(x), 

H{p) := — / p{x) log p{x) dx. 

Jr™ 

Since entropy measures the average information content of the possible val- 
ues of A ~ /r, entropy is often interpreted as a measure of the uncertainty 
implicit in p. (Remember that if p is very ‘spread out’ and describes a lot of 
uncertainty about A, then observing a particular value of A carries a lot of 
‘surprise value’ and hence a lot of information.) 

Example 5.2. Consider a Bernoulli random variable A taking values in 
X\,X 2 G A with probabilities p, 1 — p E [0, 1] respectively. This random vari- 
able has entropy 

— plogp — (1 — p) log(l — p). 

If A is certain to equal aq, then p = 1, and the entropy is 0; similarly, if A is 
certain to equal aq, then p = 0, and the entropy is again 0; these two distribu- 
tions carry zero information and have minimal entropy. On the other hand, 
if p — in which case A is uniformly distributed on A, then the entropy is 
log 2; indeed, this is the maximum possible entropy for a Bernoulli random 
variable. This example is often interpreted as saying that when interrogat- 
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ing someone with questions that demand “yes” or “no” answers, one gains 
maximum information by asking questions that have an equal probability of 
being answered “yes” versus “no”. 

Proposition 5.3. Let p and v be probability measures on discrete sets or 
M n . Then the product measure p (g) v satisfies 

H{p®is)=H{p) + H{ v). 

That is, the entropy of a random vector with independent components is the 
sum of the entropies of the component random variables. 

Proof. Exercise 5.1. □ 
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The definition of entropy in (5.3) implicitly uses a uniform measure (counting 
measure on a finite set, or Lebesgue measure on M n ) as a reference measure. 
Upon reflection, there is no need to privilege uniform measure with being 
the unique reference measure; indeed, in some settings, such as infinite- 
dimensional Banach spaces, there is no such thing as a uniform measure 
(cf. Theorem 2.38). In general, if /a is a probability measure on a measur- 
able space (X , LP) with reference measure i r, then we would like to define the 
entropy of p with respect to 7 r by an expression like 

H(p\ir) = — [ ^(x) log — ^(x) d7r(x) 

Jr d7r 


whenever p has a Radon-Nikodym derivative with respect to it. The negative 
of this functional is a distance-like function on the set of probability measures 
on (X, LP): 

Definition 5.4. Let p, v be cr-finite measures on (T,J^). The Kullback- 
Leibler divergence from p to v is defined to be 

AoXmIIO ■= f -rlogU^ ( log^d/Li 

J x ^ ^ j X 


if p <C v and this integral is finite, and +oo otherwise. 

While Dkl( • || * ) is non-negative, and vanishes if and only if its arguments 
are identical (see Exercise 5.3), it is neither symmetric nor does it satisfy 
the triangle inequality. Nevertheless, it can be used to define a topology on 
fA+(X) or A4i(X) by taking as a basis of open sets for the topology the 
‘balls’ 

Dkl{p\\v) < s} 


U{n,e) ■.= {v 
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for arbitrary fi and e > 0. The following result and Exercise 5.6 show that 
Dkl{ • || • ) generates a topology on M.\(X) that is strictly finer /stronger than 
that generated by the total variation distance (2.4): 

Theorem 5.5 (Pinsker, 1964). For any /x, v E M.\(X,&), 

^tv(/U^) < >/2D KL (mH. 

Proof. Consider a Hahn decomposition (Theorem 2.24) of (T,J^) with 
respect to the signed measure /x — v\ let Aq and A\ be disjoint measurable 
sets with union X such that every measurable subset of Aq (resp. A\) has 
non-negative (resp. non-positive) measure under fi — v. Let A := {Aq,Ai}. 
Then the induced measures and z /4 on {0, 1} satisfy 

dTv(/i, v) = || fl — Z'IItV 

= 0 - OUi) - 0 - OU 2 ) 

= OVo) - ^( 0 )) - O^(i) - v A {\)) 

= d T v(pA, vji)- 

By the partition inequality (Exercise 5.5), T>kl(p||^) > T^ki/pmIIca)? so if 
suffices to prove Pinsker’ s inequality in the case that X has only two elements 
and & is the power set of X. 

To that end, let X := {0, 1}, and let 

g = pSo + (1 -p) 6 1 , 

v = q6 0 + (l- q)S 1 . 

Consider, for fixed c E R and p E [0, 1], 

p 1 — p 

g{q) := p log - + (1 - p) log — 4 c(p - q) 2 . 

q q 

Note that p(p) = 0 and that, for q E ( 0 , 1 ), 

TdO = -- + 1 — - + 8c(p - g) 
ag q 1 — q 

= {ll - p> (si UN 

Since, for all q E [0, 1], 0 < <7(1 — < 7 ) < |, it follows that for any c < g{q) 
attains its minimum at q = p. Thus, for c < 

2 

p(0 = D klOIIO - c(|p - <?| + 1(1 - p) - (1 - 01) 

2 

= T)kl(p||^) — c(dxv(p, z')) 

> 0 . 

Setting c = ^ yields Pinsker ’s inequality. □ 
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One practical use of information-theoretic quantities such as the Kullback- 
Leibler divergence is to design experiments that will, if run, yield a maximal 
gain in the Shannon information about the system of interest: 

Example 5.6 (Bayesian experimental design). Suppose that a Bayesian 
point of view is adopted, and for simplicity that all the random variables 
of interest are finite-dimensional with Lebesgue densities p(-). Consider the 
problem of selecting an optimal experimental design A for the inference of 
some parameters/unknowns 6 from the observed data y that will result from 
the experiment A. If, for each A and 0, we know the conditional distribution 
y\\,6 of the observed data y given A and 0, then the conditional distribution 
y\\ is obtained by integration with respect to the prior distribution of 0: 

p(y\ x ) = [ p{y\\o)p{6)de. 


Let U (y, A) be a real- valued measure of the utility of the posterior distribution 


p{0\y, A) 


p(y\9, W(6>) 
p{y |A) 


For example, one could take the utility function U (y. A) to be the Kullback- 
Leibler divergence Dkl {p( • \y, A)||p( • |A)) between the prior and posterior 
distributions on 6. An experimental design A that maximizes 


/7(A) := J U(y,\)p(y\\)dy 

is one that is optimal in the sense of maximizing the expected gain in Shannon 
information. 

In general, the optimization problem of finding a maximally informative 
experimental design is highly non-trivial, especially in the case of compu- 
tationally intensive likelihood functions. See, e.g., Chaloner and Verdinelli 
(1995) for a survey of this large field of study. 


Divergences and Other Distances. The total variation distance and 
Kullback-Leibler divergence are special cases of a more general class of 
distance-like functions between pairs of probability measures: 

Definition 5.7. Let y and v be cr-finite measures on a common measurable 
space (T, J^), and let /: [0, oo] — > R U {-Too} be any convex function such 
that /( 1) = 0. The f -divergence from y to v is defined to be 


Df{n\\v) 




if y TC v, 


otherwise. 


(5.4) 
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Equivalently, in terms of any reference measure p with respect to which both 
p and v are absolutely continuous (such as p + v), 

=-//(£/£)£* <»■»> 

It is good practice to check that the alternative definition (5.5) is, in fact, 
independent of the reference measure used: 

Lemma 5.8. Suppose that p and v are absolutely continuous with respect to 
both pi and p 2 . Then pi and p 2 are equivalent measures except for sets of 
(p + v) -measure zero, and (5.5) defines the same value with p = pi as it does 
with p = p 2 . 

Proof. Suppose that pi and p 2 are inequivalent. Then, without loss of gen- 
erality, there exists a measurable set E such that pi(E) = 0 but p 2 {E) > 0. 
Therefore, since p <C pi and v <C pi, it follows that p[E) — v(E) = 0. Thus, 
although pi and p 2 may be inequivalent for arbitrary measurable sets, they 
are equivalent for sets of positive (p + z/)-measure. 

Now let E be a set of full measure under v, so that 4^ exists and is nowhere 
zero in E. Then the chain rule for Radon-Nikodym derivatives (Theorem 
2.30) yields 


JX \Qpl / d-pl J dpi 

Je \\dp2dpi// \dp2 dpi J J 

-!SE^y 

Jx \dp2 / dp2 J dp2 


since v(X \E) = 0 
by Theorem 2.30 


□ 


Jensen’s inequality and the conditions on / immediately imply that 
/- divergences of probability measures are non-negative: 


q / > / 



/(I) = 0 . 


For strictly convex /, equality holds if and only if p = v, and for the Kullback- 
Leibler distance this is known as Gibbs ; inequality (Exercise 5.3). 

Example 5.9. (a) The total variation distance defined in (2.4) is the /- 
divergence with f(t) := 1 1 — 1|; this can be seen most directly from for- 
mulation (5.5). As already discussed, dxv is a metric on the space of 
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probability measures on (T, J^), and indeed it is a norm on the space of 
signed measures on (T, J^). Under the total variation distance, AU(T) 
has diameter at most 2. 

(b) The Kullback-Leibler divergence is the /- divergence with f(t) := tlogt. 
This does not define a metric, since in general it is neither symmetric nor 
does it satisfy the triangle inequality. 

(c) The Hellinger distance is the square root of the /- divergence with f(t ) := 

| y/t — 1| 2 , i.e. 


du(p, v ) 2 




2 

d v 




dp 


for any reference measure p, and is a bona fide metric. 


The total variation and Kullback-Leibler distances and their associated 
topologies are related by Pinsker’s inequality (Theorem 5.5); the correspond- 
ing result for the total variation and Hellinger distances and their topologies 
is Kraft’s inequality (see Steerneman (1983) for generalizations to signed and 
product measures): 

Theorem 5.10 (Kraft, 1955). Let fi, v be probability measures on 
Then 

d H (p, ^) 2 < d T v(p, v) < 2 d H (p, v). (5.6) 

Hence, the total variation metric and Hellinger metric induce the same topol- 
ogy on Ali (O). 

Remark 5.11. It also is common in the literature to see the total variation 
distance defined as the /- divergence with f(t) := ^\t — 1| and the Hellinger 
distance defined as the square root of the /-divergence with f{t) := \ \\/i— 1| 2 . 
In this case, Kraft’s inequality (5.6) becomes 


< d T v(/t, 0 < V2dn(n,v). 


(5.7) 


A useful property of the Hellinger distance is that it provides a Lipschitz- 
continuous bound on how the expectation of a random variable changes when 
changing measure from one measure to another. This property will be useful 
in the results of Chapter 6 on the well-posedness of Bayesian inverse problems. 

Proposition 5.12. Let (V, || • ||) be a Banach space, and suppose that f : X —> 
V has finite second moment with respect to p, v e M\(X). Then 


E M [/j - E„[f}\\ < 2 AmIII/II 2 ] + E 4II/II 2 ] dnM. 
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Proof. Exercise 5.7. □ 

There are also useful measures of distance between probability measures 
that make use of the metric space structure of the sample space, if it has 
one. The following metric, the Levy-Prokhorov distance, is particularly imp- 
ortant in analysis because it corresponds to the often-used topology of weak 
convergence of probability measures: 

Definition 5.13. The Levy-Prokhorov distance between probability mea- 
sures ii and i/ona metric space (X , d) is defined by 

fi{A) < v(A £ ) + e and 1 

HA) < p(A £ ) + e for all measurable A C X J ’ 

where A £ denotes the open ^-neighbourhood of A in the metric d, i.e. 

A £ := B e (a) = {x E X \ d(a,x) < £ for some a E A}. 

a £ A 

It can be shown that this defines a metric on the space of probability 
measures on X. The Levy-Prokhorov metric g?lp on Mi(X) inherits many 
of the properties of the original metric d on X: if (X, d) is separable, then so 
too is (Ali(T), (7 lp); and if (T,d) is complete, then so too is (AIi(T),4p)‘ 
By (h) below, the Levy-Prokhorov metric metrizes the topology of weak 
convergence of probability measures, which by (d) below is essentially the 
topology of convergence of bounded and continuous statistics: 

Theorem 5.14 (Portmanteau theorem for weak convergence). Let (/i n )neN 
be a sequence of probability measures on a topological space X , and let 
pi E Mi(X). Then the following are equivalent, and determine the weak 
convergence of fi n to pa 

(a) limsup n ^ 00 /i n (F) < pi(F) for all closed F C X; 

(b) lim inf n _ 5 . oc fi n (U) > /i(U) for all open U C X; 

(c) lim n ^oo pb n {A) = pi (A) for all A C X with fi(dA) = 0; 

(d) lim n ^ 00 E n n [f] = E M [/] for every bounded and continuous f: X —> R; 

(e) lim n ^ 00 E /lri [f] = E M [/] for every bounded and Lipschitz f: X R; 

(f) limsup„^ 00 E i ^[f] < E n[f] for every f: X R that is upper semi- 
continuous and bounded above ; 

(g) lim inf n ^oo E Mn [/] > E M [/] for every f: X —> R that is lower semi- 
continuous and bounded below; 

(h) when X is metrized by a metric d, lim n ^ 00 dLp(/i n ,d) = 0. 

Some further examples of distances between probability measures are in- 
cluded in the exercises at the end of the chapter, and the bibliography gives 
references for more comprehensive surveys. 


dLp(/U v) •= inf < £ > 0 
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5.5 Bibliography 
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The Kullback-Leibler divergence was introduced by Kullback and Leibler 
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for a comprehensive review of this large field. 
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drov (1940, 1941, 1943). Theorem 5.14, the portmanteau theorem for weak 
convergence, can be found in many texts on probability theory, e.g. that of 
Billingsley (1995, Section 2). 

The Wasserstein metric (also known as the Kantorovich or Rubinstein 
metric, or earth-mover’s distance) of Exercise 5.11 plays a central role in 
the theory of optimal transportation; for comprehensive treatments of this 
topic, see the books of Villani (2003, 2009), and also Ambrosio et al. (2008, 
Chapter 6). Gibbs and Su (2002) give a short self-contained survey of many 
distances between probability measures, and the relationships among them. 
Deza and Deza (2014, Chapter 14) give a more extensive treatment of 
distances between probability measures, in the context of a wide-ranging 
discussion of distances of all kinds. 


5.6 Exercises 

Exercise 5.1. Prove Proposition 5.3. That is, suppose that /i and v are prob- 
ability measures on discrete sets or M n , and show that the product measure 
fi (8) v satisfies 
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That is, the entropy of a random vector with independent components is the 
sum of the entropies of the component random variables. 

Exercise 5.2. Let fi o = A/*(rao, Co) and fi\ = be non-degenerate 

Gaussian measures on M n . Show that 


Dkl(moIImi) 


1 f det C\ 

2 V° S det Co 


n + tr (C 1 1 C 0 ) T 


m o 



Hint: use the fact that, when X ~ J\f{m , C) is an M n -valued Gaussian random 
vector and A E M nXn is symmetric, 

E[X • AX] = tr (AC) T m • Am. 


Exercise 5.3. Let /i and v be probability measures on a measurable space 
(T, J^). Prove Gibbs ; inequality that -DkiXmIW >0, with equality if and only 
if fJL = v. 

Exercise 5.4. Let / satisfy the requirements for Df( • || • ) to be a divergence. 

(a) Show that the function (x,y) yf(x/y ) is a convex function from 
(0, oo ) x (0, oo ) to R U {Too}. 

(b) Hence show that Df( • || • ) is jointly convex in its two arguments, i.e. for 
all probability measures /io, Mu U), and v\ and t E [0,1], 

D/((l - £)mo +^i||(l - Uo + tv i) < (1 - O-D/OUN) + tD/OUU)- 


Exercise 5.5. The following result is a useful one that frequently allows 
statements about /- divergences to be reduced to the case of a finite or count- 
able sample space. Let (T, fi) be a probability space, and let /: [0, oo] 

[0, oo] be convex. Given a partition A = {A n \ n E N} of X into countably 
many pairwise disjoint measurable sets, define a probability measure mm on 
N by HA(n) ■= ^{A n ). 

(a) Suppose that fi{A n ) > 0 and that /i <C v. Show that, for each n E N, 


1 

v( A n) 




d v>f 


f M(^-n) \ 


(b) Hence prove the following result, known as the partition inequality : for 
any two probability measures fi and v on X with fi <C v, 


D f (p\\v) > D f (p A \\v A ). 

Show also that, for strictly convex /, equality holds if and only if p(A n ) = 
v(A n ) for each n. 


Exercise 5.6. Show that Pinsker’s inequality (Theorem 5.5) cannot be 
reversed. In particular, give an example of a measurable space (T, J^) such 
that, for any e > 0, there exist probability measures fi and v on (T, with 
dTv(/U v) < e but -Dkl (m II = Hint: consider a ‘small’ perturbation to 
the CDF of a probability measure on R. 
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Exercise 5.7. Prove Proposition 5.12. That is, let (V, ||-||) be a Banach 
space, and suppose that / : X — > V has finite second moment with respect to 
/i, v G Afi(T). Then 


[/] - E„ [/] 1 1 < 2 [ 1 1 / 1 1 : 2] + E„ [ 1 1 / 1 1 : 2] d H (p ,u) . 


Exercise 5.8. Suppose that / 1 and v are equivalent probability measures on 
(V LF) and define 


d(/i, v) := ess sup 

x£X 


log 



(See Example 2.7 for the definition of the essential supremum.) Show that this 
defines a well-defined metric on the measure equivalence class £ containing 
p and v. In particular, show that neither the choice of function used as the 
Radon-Nikodym derivative nor the choice of measure in £ with respect 
to which the essential supremum is taken, affects the value of d(/x, v). 

Exercise 5.9. For a probability measure p on R, let : R [0, 1] be the 
cumulative distribution function (CDF) defined by 


Ffi(x) •= /i((-oo,x]). 

Show that the Levy-Prokhorov distance between probability measures /x, 
v G 2Wi(R) reduces to the Levy distance , defined in terms of their CDFs T) x , 
F v by 




inf > 0 



e) 


e < F v {x) < F^(x + e) + s} . 


Convince yourself that this distance can be visualized as the side length of the 
largest square with sides parallel to the coordinate axes that can be placed 
between the graphs of F ^ and F v . 

Exercise 5.10. Let (X,d) be a metric space, equipped with its Borel 
cr-algebra. The Lukas zyk-Karmow ski distance between probability measures 
/a and v is defined by 


^lk(/T v ) 


/ / d(x, x') dp(x)dis(x'), 

J x J x 


Show that this satisfies all the requirements to be a metric on the space of 
probability measures on X except for the requirement that c^lk^/LaO = 0* 
Hint: suppose that p — Af(m , a 2 ) on R, and show that dLK(/T p) = yy • 

Exercise 5.11. Let (X,d) be a metric space, equipped with its Borel 
cr-algebra. The Wasserstein distance between probability measures p and 
v is defined by 


dw(p, v) 


inf 

7 


d(x, x') dy (x, x '), 


XxX 
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where the infimum is taken over the set .T(/x, v) of all measures 7 on X x X 
such that the push- forward of 7 onto the first (resp. second) copy of X is p 
(resp. v). Show that this defines a metric on the space of probability measures 
on X, bounded above by the Lukaszyk-Karmowski distance, i.e. 

d\v(/U v) < d\^(p,v). 

Verify also that the p- Wasserstein distance 


dw ,p(/U 


inf 

7 


d(x, x') p d7(x, x) 


i/p 


XxX 


where p > 1 , is a metric. Metrics of this type, and in particular the case p = 1 , 
are sometimes known as the earth-mover’s distance or optimal transportation 
distance , since the minimization over 7 G k(/i, v) can be seen as finding the 
optimal way of moving/rearranging the pile of earth p into the pile v. 


Chapter 6 

Bayesian Inverse Problems 


It ain’t what you don’t know that gets you 
into trouble. It’s what you know for sure that 
just ain’t so. 


Mark Twain 


This chapter provides a general introduction, at the high level, to the back- 
ward propagation of uncertainty /informat ion in the solution of inverse prob- 
lems, and specifically a Bayesian probabilistic perspective on such inverse 
problems. Under the umbrella of inverse problems, we consider parameter 
estimation and regression. One specific aim is to make clear the connection 
between regularization and the application of a Bayesian prior. The filtering 
methods of Chapter 7 fall under the general umbrella of Bayesian approaches 
to inverse problems, but have an additional emphasis on real-time computa- 
tional expediency. 

Many modern UQ applications involve inverse problems where the unknown 
to be inferred is an element of some infinite-dimensional function space, e.g. 
inference problems involving PDEs with uncertain coefficients. Naturally, 
such problems can be discretized, and the inference problem solved on the 
finite-dimensional space, but this is not always a well-behaved procedure: 
similar issues arise in Bayesian inversion on function spaces as arise in the 
numerical analysis of PDEs. For example, there are ‘stable’ and ‘unstable’ 
ways to discretize a PDE (e.g. the Courant-Friedrichs-Lewy condition), and 
analogously there are ‘stable’ and ‘unstable’ ways to discretize a Bayesian inv- 
erse problem. Sometimes, a discretized PDE problem has a solution, but the 
original continuum problem does not (e.g. the backward heat equation, or the 
control problem for the wave equation), and this phenomenon can be seen in 
the ill-conditioning of the discretized problem as the discretization dimension 
tends to infinity; similar problems can afflict a discretized Bayesian inverse 


(c) Springer International Publishing Switzerland 2015 
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problem. Therefore, one aim of this chapter is to present an elementary well- 
posedness theory for Bayesian inversion on the function space, so that this 
well-posedness will automatically be inherited by any finite-dimensional dis- 
cretization. For a thorough treatment of all these questions, see the sources 
cited in the bibliography. 


6.1 Inverse Problems and Regularization 

Many mathematical models, and UQ problems, are forward problems , i.e. we 
are given some input u for a mathematical model H , and are required to 
determine the corresponding output y given by 

y = H(u), (6.1) 

where ZY, y are, say, Banach spaces, u <E U, y <E y, and H: U — ^ 3^ is the 
observation operator. However, many applications require the solution of the 
inverse problem : we are given y and H and must determine u such that (6.1) 
holds. Inverse problems are typically ill-posed: there may be no solution, the 
solution may not be unique, or there may be a unique solution that depends 
sensitively on y. Indeed, very often we do not actually observe H(u), but 
some noisily corrupted version of it, such as 

y = H(u) -t- r\. (6.2) 

The inverse problem framework encompasses that problem of model cal- 
ibration (or parameter estimation ), where a model Hq relating inputs to 
outputs depends upon some parameters 6 G (9, e.g., when U = y = (9, 
H e (u) = On. The problem is, given some observations of inputs tq and corre- 
sponding outputs yi, to find the parameter value 6 such that 

yi = Ho(ui) for each i. 

Again, this problem is typically ill-posed. 

One approach to the problem of ill-posedness is to seek a least-squares 
solution: find, for the norm || • \\y on y, 


arg mm 
u dlA 


y - H(u) 


2 

y 


However, this problem, too, can be difficult to solve as it may possess min- 
imizing sequences that do not have a limit in Uf or may possess multiple 
minima, or may depend sensitively on the observed data y. Especially in this 


1 Take a moment to reconcile the statement “there may exist minimizing sequences that 
do not have a limit in LC 5 with IA being a Banach space. 
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last case, it may be advantageous to not try to fit the observed data too 
closely, and instead regularize the problem by seeking 


argminj y — H{u) ||^ + 


u — u 


ueV CU 


for some Banach space V embedded in U and a chosen u E V. The standard 
example of this regularization setup is Tikhonov regularization , as in Theorem 
4 . 28 : when U and y are Hilbert spaces, given a compact, positive, self-adjoint 
operator f? on W, we seek 


arg mm 


in | y — H(u) ||^ + R 1 ^ 2 (u — u) ||^ u E 


The operator R describes the structure of the regularization, which in some 
sense is the practitioner’s ‘prior belief about what the solution should look 
like’. More generally, since it might be desired to weight the various compo- 
nents of y differently from the given Hilbert norm on y, we might seek 


argminj Q 1 / 2 (t7 — H{u)) ||^ + R 1 ^ 2 (u — u) ||^ u G U j 


for a given positive self-adjoint operator Q on y. However, this approach all 
appears to be somewhat ad hoc, especially where the choice of regularization 
is concerned. 

Taking a probabilistic — specifically, Bayesian — viewpoint alleviates 
these difficulties. If we think of u and y as random variables, then ( 6 . 2 ) 
defines the conditioned random variable y\u, and we define the ‘solution’ of 
the inverse problem to be the conditioned random variable u\y. This allows 
us to model the noise, 77, via its statistical properties, even if we do not know 
the exact instance of 77 that corrupted the given data, and it also allows us 
to specify a priori the form of solutions that we believe to be more likely, 
thereby enabling us to attach weights to multiple solutions which explain the 
data. This is the essence of the Bayesian approach to inverse problems. 

Remark 6.1. In practice the true observation operator is often approxi- 
mated by some numerical model H( • ; /i), where h denotes a mesh parameter, 
or parameter controlling missing physics. In this case ( 6 . 2 ) becomes 


y = H(u] ft) + £ + 77, 

where £ := H(u) — H(u;h). In principle, the observational noise 77 and the 
computational error £ could be combined into a single term, but keeping them 
separate is usually more appropriate: unlike 77, £ is typically not of mean zero, 
and is dependent upon u. 

To illustrate the central role that least squares minimization plays in ele- 
mentary statistical estimation, and hence motivate the more general consid- 
erations of the rest of the chapter, consider the following finite-dimensional 
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linear problem. Suppose that we are interested in learning some vector of 
parameters which gives rise to a vector y E M m of observations via 

y = Au + rj, 

where A E M mXn is a known linear operator (matrix) and y is a (not nec- 
essarily Gaussian) noise vector known to have mean zero and symmetric, 
positive-definite covariance matrix Q := K[rj ® y] = E[? 7 ? 7 *] E M mXm , with y 
independent of u. A common approach is to seek an estimate u of u that is 
a linear function Ky of the data y is unbiased in the sense that E[&] = u, 
and is the best estimate in that it minimizes an appropriate cost function. 
The following theorem, the Gauss-Markov theorem, states that there is pre- 
cisely one such estimator, and it is the solution to the weighted least squares 
problem with weight Q -1 , i.e. 

I 

u = argmin J(u), J(u ) := - \\Au — y\\ 2 0 -i . 
uen 2 

In fact, this result holds true even in the setting of Hilbert spaces: 

Theorem 6.2 (Gauss-Markov). LetTL and 1C be separable Hilbert spaces, and 
let A: H -n 1C. Let u E H and let y = An + y, where y is a centred JC-valued 
random variable with self-adjoint and positive definite covariance operator Q. 
Suppose that Q 1 ^ 2 A has closed range and that A*Q _1 A is invertible. Then, 
among all unbiased linear estimators K : JC —> TL, producing an estimate 
u = Ky of u given y, the one that minimizes both the mean- squared error 
E[||u — u\\ 2 } and the covariance operator 2 E[(fx — u) G (u — u)} is 

K = (A*Q- 1 A)- 1 A*Q- 1 , (6.3) 

and the resulting estimate u has E[u] = u and covariance operator 

E[(u — u) G (u — u)] = ( k A*Q~ 1 A)~ 1 . 

Remark 6.3. Indeed, by Theorem 4.28, u = (A*Q- 1 A)~ 1 A*Q~ 1 y is also 
the solution to the weighted least squares problem with weight Q -1 , i.e. 

I 

u = argmin J(u), J{u) := - \\Au — y\\ 2 Q~i . 
uen 2 

Note that the first and second derivatives (gradient and Hessian) of J are 

VJ(u) = A*Q~ 1 Au - A*Q~ x y, and V 2 J(u) = A*Q~ X A, 

so the covariance of u is the inverse of the Hessian of J. These observations 
will be useful in the construction of the Kalman filter in Chapter 7. 


2 Here, the minimization is meant in the sense of positive semi-definite operators: for two 
operators A and B , we say that A < B if B — 4 is a positive semi-definite operator. 
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Proof of Theorem 6.2. It is easily verified that K as given by (6.3) is an 
unbiased estimator: 

u = (A* Q- 1 A)- 1 A* Q- 1 {Au + T]) =U + [A*Q~ 1 A)~ 1 A*Q~ 1 r] 

and so, taking expectations of both sides and using the assumption that rj is 
centred, E [u\ = u. Moreover, the covariance of this estimator satisfies 

E[(h - u) <g> (u - u)} = KE[r] <g> rj\K = {A*Q~ 1 A)~ 1 , 


as claimed. 

Now suppose that L = K + D is any linear unbiased estimator; note that 
DA = 0. Then the covariance of the estimate Ly satisfies 

E[(Ly -u)® ( Ly - u)] = E [(K + D)y ® r](K* + D*)] 

= (K + D)Q(K* + D*) 

= KQK * + DQD* + KQD* + ( KQD *)*. 


Since DA = 0, 

KQD* = ( A*Q- 1 A)~ 1 A*Q~ 1 QD * = {A*Q- 1 A)~ 1 {DA)* = 0, 


and so 

E [(Ly -u)® ( Ly - u)} = KQK * + DQD* > KQK*. 

Since DQD * is self-adjoint and positive semi-definite, this shows that 

E [(Ly -u)® ( Ly - u)\ > KQK*. □ 

Remark 6.4. In the finite-dimensional case, if A*Q _1 A is not invertible, 
then it is common to use the estimator 

K = (A*Q- 1 A)' I A*Q- 1 , 

where B' denotes the Moore-Penrose pseudo-inverse of B, defined equiva- 
lently by 


:= lim (B*B + 5I)B*, 

5 ^ 0 

:= lim B*(BB* + SI)B*, or 

5—^0 

:= VE^U*, 

where B = UEV* is the singular value decomposition of B, and E^ is 
the transpose of the matrix obtained from E by replacing all the strictly 
positive singular values by their reciprocals. In infinite-dimensional settings, 
the use of regularization and pseudo-inverses is a more subtle topic, especially 
when the noise y has degenerate covariance operator Q. 
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Bayesian Interpretation of Regularization. The Gauss-Markov estimator 
is not ideal: for example, because of its characterization as the minimizer of 
a quadratic cost function, it is sensitive to large outliers in the data, i.e. com- 
ponents of y that differ greatly from the corresponding component of Au. In 
such a situation, it may be desirable to not try to fit the observed data y too 
closely, and instead regularize the problem by seeking u, the minimizer of 

l i 

J(u) ■■= -\\Au - yW^ + -\\u-uW%-!, (6.4) 


for some chosen u E ~K n and positive-definite Tikhonov matrix R E K nXn . 
Depending upon the relative sizes of Q and R, u will be influenced more 
by the data y and hence he close to the Gauss-Markov estimator, or be 
influenced more by the regularization term and hence he close to u. At first 
sight this procedure may seem somewhat ad hoc, but it has a natural Bayesian 
interpretation. 

Let us make the additional assumption that, not only is y centred with 
covariance operator Q, but it is in fact Gaussian. Then, to a Bayesian prac- 
titioner, the observation equation 

y = Au + r] 

defines the conditional distribution y\u as (y — Au)\u = y ~ A/"(0, Q). Finding 
the minimizer of u ^||Au — 2/||q-i, i.e. u = Ky, amounts to finding the 
maximum likelihood estimator of u given y. The Bayesian interpretation of 
the regularization term is that J\f(u, R ) is a prior distribution for u. The 
resulting posterior distribution for u\y has Lebesgue density p{u\y) with 


p(u\y) oc exp 


t;\\ Au - y\\ 2 Q-i ) exp ( -h\u - ufa-t 


= exp 

= exp 


= exp i — -|| Au - y\\ 2 Q -i - - \\u - 


— II u — Ky ll<2 


A*Q~ 1 A 


-li 2 

R- 


U\u - P~\A*Q- 1 AKy + R~ l u)\\ 2 P 


where, by Exercise 6.1, P is the precision matrix 

P = A*Q~ l A + R- 1 . 

The solution of the regularized least squares problem of minimizing the func- 
tional J in (6.4) — i.e. minimizing the exponent in the above posterior distri- 
bution — is the maximum a posteriori estimator of u given y. However, the 
full posterior contains more information than the MAP estimator alone. In 
particular, the posterior covariance matrix P~ l = (A* Q~ l A-\- R~ l )~ x reveals 
those components of u about which we are relatively more or less certain. 
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Non-Quadratic Regularization and Recovery of Sparse Signals. This 
chapter mostly deals with the case in which both the noise model (i.e. the 
likelihood) and the prior are Gaussian measures, which is the same as saying 
that the maximum a posteriori estimator is obtained by minimizing the sum 
of the squares of two Hilbert norms, just as in (6.4). However, there is no 
fundamental reason not to consider other regularizations — or, in Bayesian 
terms, other priors. Indeed, in many cases an appropriate choice of prior is a 
probability distribution with both a heavy centre and a heavy tail, such as 


d/ip 
d u 


(■ u ) oc exp 






on M n , for 0 < p < 1. Such regularizations correspond to a prior belief that 
the u to be recovered from noisy observations y is sparse , in the sense that it 
has a simple low-dimensional structure, e.g. that most of its components in 
some coordinate system are zero. 

For definiteness, consider a finite-dimensional example in which it is 
desired to recover u E from noisy observations y E of Au, where 
A E K mXn is known. Let 


u\\o • — ^ { 1 , . . . , ?r]- 1 tq 0}* 


(Note well that, despite the suggestive notation, || • || 0 is not a norm, since 
in general ||Au||o ^ |A|||u||o.) If the corruption of Au into y occurs through 
additive Gaussian noise distributed according to jV( 0, Q), then the ordinary 
least squares estimate of u is found by minimizing — y || q— i * However, 

a prior belief that u is sparse, i.e. that \\u\\o is small, is reflected in the 
regularized least squares problem 


find u E K n to minimize Jq(u) 


7,\\ Au ~ yWh- 1 + A IMIo, 


(6.5) 


where A > 0 is a regularization parameter. Unfortunately, problem (6.5) is 
very difficult to solve numerically, since the objective function is not convex. 
Instead, we consider 


find u E K n to minimize J\(u) 



y\\ q-i + A||u||i. 


(6.6) 


Remarkably, the two optimization problems (6.5) and (6.6) are ‘often’ equiv- 
alent in the sense of having the same minimizers; this near-equivalence can 
be made precise by a detailed probabilistic analysis using the so-called res- 
tricted isometry property , which will not be covered here, and is foundational 
in the field of compressed sensing. Regularization using the 1-norm amounts 
to putting a Laplace distribution Bayesian prior on u , and is known in the 
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statistical regression literature as the LASSO (least absolute shrinkage and 
selection operator); in the signal processing literature, it is known as basis 
pursuit denoising. 

For a heuristic understanding of why regularizing using the norm || • ||i pro- 
motes sparsity, let us consider an even more general problem: let R : K n — )► R 
be any convex function, and consider the problem 

find u E K n to minimize Jr{u ) := || Au — T||q-i + R(u ), (6.7) 

which clearly includes (6.4) and (6.6) as special cases. Observe that, by writ- 
ing r = R(pc) for the value of the regularization term, we have 

inf Jr(u) = inf ( r + inf II Au — 6 1 1 o— i 1 • (6.8) 

uEl& n r> 0 y u:R{u)—r J 

The equality constraint in (6.8) can in fact be relaxed to an inequality: 


inf Jr(u ) = inf ( r H- inf II Au 

uE K n r> 0 V n:P(n)<r 



(6.9) 


Note that convexity of R implies that {u E | R(u) < r} is a convex subset 
of K n . The reason for the equivalence of (6.8) and (6.9) is quite simple: if 
(r, u) = (r*,iF) were minimal for the right-hand side and also R(u *) < r*, 
then the right-hand side could be reduced by considering instead (r, u ) = 
(R(u*),u*), which preserves the value of the quadratic term but decreases 
the regularization term. This contradicts the optimality of (r*,rA). Hence, 
in (6.9), we may assume that the optimizer has R(u*) = r*, which is exactly 
the earlier problem (6.8). 

In the case that R(u) is a multiple of the 1- or 2-norm of u , the region 
R(u) < r is a norm ball centred on the origin, and the above arguments 
show that the minimizer u * of J\ or J 2 will be a boundary point of that 
ball. However, as indicated in Figure 6.1, in the 1-norm case, this u * will 
‘typically’ he on one of the low-dimensional faces of the 1-norm ball, and so 
||u*||o will be small and u* will be sparse. There are, of course, y for which 
u* is non-sparse, but this is the exception for 1-norm regularization, whereas 
it is the rule for ordinary 2-norm (Tikhonov) regularization. 


6.2 Bayesian Inversion in Banach Spaces 

This section concerns Bayesian inversion in Banach spaces, and, in particular, 
establishing the appropriate rigorous statement of Bayes’ rule in settings 
where — by Theorem 2.38 — there is no Lebesgue measure with respect 
to which we can take densities. Therefore, in such settings, it is necessary 
to use as the prior a measure such as a Gaussian or Besov measure, often 
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Quadratic (£ 2 ) regularization. Sparse (l 1 ) regularization. 

Fig. 6.1: Comparison of £ 2 versus t 1 regularization of a least squares 
minimization problem. The shaded region indicates a norm ball centred 
on the origin for the appropriate regularizing norm. The black ellipses, 
centred on the unregularized least squares ( Gauss-Mar kov) solution Ky — 
(. A*Q~ 1 A)~ 1 A*Q~ x y , are contours of the original objective function, u 
|| Au — y||g_i. By (6.9), the regularized solution u* lies on the intersection of 
an objective function contour and the boundary of the regularization norm 
ball; for the 1-norm, u* is sparse for ‘most’ y. 


accessed through a sampling scheme such as a Karhunen-Loeve expansion, as 
in Section 11.1. Note, however, then when the observation operator H is non- 
linear, although the prior may be a ‘simple’ Gaussian measure, the posterior 
will in general be a non- Gaussian measure with features such as multiple 
modes of different widths. Thus, the posterior is an object much richer in 
information than a simple maximum likelihood or maximum a posteriori 
estimator obtained from the optimization-theoretic point of view. 

Example 6.5. There are many applications in which it is of interest to det- 
ermine the permeability of subsurface rock, e.g. the prediction of transport of 
radioactive waste from an underground waste repository, or the optimization 
of oil recovery from underground fields. The flow velocity v of a fluid under 
pressure p in a medium or permeability n is given by Darcy 7 s law 

v = —nVp. 

The pressure field p within a bounded, open domain X C is governed by 
the elliptic PDE 

—V • (ftVp) = 0 in T, 

together with some boundary conditions, e.g. the Neumann (zero flux) bound- 
ary condition Vp • uqx = 0 on dX\ one can also consider a non-zero source 
term / on the right-hand side. For simplicity, take the permeability tensor 
field k to be a scalar field k times the identity tensor; for mathematical and 
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physical reasons, it is important that k be positive, so write k — e u . The obj- 
ective is to recover u from, say, observations of the pressure field at known 
points xi, . . . , x m G X\ 

Vi = p(Xi) +T)i. 

Note that this fits the general 'y = H(u) + if setup, with H being defined 
implicitly by the solution operator to the elliptic boundary value problem. 

In general, let u be a random variable with (prior) distribution po — which 
we do not at this stage assume to be Gaussian — on a separable Banach space 
U. Suppose that we observe data y G M m according to (6.2), where y is an 
M m -valued random variable independent of u with probability density p with 
respect to Lebesgue measure. Let <L>(u; y ) be any function that differs from 
— log p(y — H(u)) by an additive function of y alone, so that 


p{y - h(u)) 
p(y ) 


ex exp (~$(u;y)) 


with a constant of proportionality independent of u. An informal application 
of Bayes’ rule suggests that the posterior probability distribution of u given 
y, p y = /io(- 1 y), has Radon-Nikodym derivative with respect to the prior, 
/i 0 , given by 


dp y 

d/io 


(u) oc exp (—@(u;y)). 


The next theorem makes this argument rigorous: 


Theorem 6.6 (Generalized Bayes’ rule). Suppose that H:U — > M m is con- 
tinuous, and that y is absolutely continuous with support M m . If u ~ p o, then 
u\y ~ p y , where p y po and 


-^—(u)(xexp(—^>(u;y)). ( 6 . 10 ) 

The proof of Theorem 6.6 uses the following technical lemma: 

Lemma 6.7 (Dudley, 2002, Section 10.2). Let p, v be probability measures 
on U x y , where ( U , srf) and (A, SS) are measurable spaces. Assume that 
p v and that ^ = (p, and that the conditional distribution of u\y under 
v, denoted by v y (& u), exists. Then the distribution of u\y under p, denoted 
p y (du), exists and p y <^iv v , with Radon-Nikodym derivative given by 


dp y 

dv y 



v{u,y) 
z(y) ’ 

1 , 


where Z(y) := f u (p(u, y) dv y (u). 


z (y) > o, 

otherwise, 
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Proof of Theorem 6.6. Let Qo(dy) := p(y)dy on M m and Q(dn|i/) := 
p(y — H(u)) dy , so that, by construction 

= C (V ) exp(-<Z>(u; y)). 

Define measures z/q and ^ ° n x U by 


v 0 (dy,du) := Q 0 (dy) <8>/i 0 (du), 
u(dy,du) := Qo(dy|u)/z 0 (du)- 


Note that vq is a product measure under which u and y are independent, 
whereas v is not. Since H is continuous, so is tP; since /jLq(U) = 1, it follows 
that tP is /io-measurable. Therefore, v is well defined, v and 


dv 

db'o 


(y, u) = C{y) exp (-<£(«; y)). 


Note that 


/ exp(-^(u;y))dy 0 (w) = C{y) / p(y - H(u)) dy 0 (u) > 0, 

Ju Ju 

since p is strictly positive on M m and H is continuous. Since v^{du\y) = 
/io(diz), the result follows from Lemma 6.7. □ 

Exercise 6.2 shows that, if the prior /xq is a Gaussian measure and the 
potential is quadratic in rz, then, for all y, the posterior p y is Gaussian. 
In particular, if the observation operator is a continuous linear map and the 
observations are corrupted by additive Gaussian noise, then the posterior is 
Gaussian — see Exercise 2.8 for the relationships between the means and 
covariances of the prior, noise and posterior. On the other hand, if either the 
observation operator is non-linear or the observational noise is non-Gaussian, 
then a Gaussian prior is generally transformed into a non-Gaussian posterior. 


6.3 Well-Posedness and Approximation 

This section concerns the well-posedness of the Bayesian inference problem for 
Gaussian priors on Banach spaces. To save space later on, the following will be 
taken as our standard assumptions on the negative log-likelihood/potential <P. 
In essence, we wish to restrict attention to potentials $ that are Lipschitz in 
both arguments, bounded on bounded sets, and that do not decay to — oo at 
infinity Too quickly’. 


102 


6 Bayesian Inverse Problems 


Assumptions on <P. Assume that T>\ U x A — >• R satisfies: 

(Al) For every e > 0 and r > o, there exists M = M(e, r) £ R such that, for 
all u £ Z4 and all y £ A with ||i/|| y <r, 

^(u;y) > M - £|M|2. 

(A2) For every r > o, there exists K — K{r^j 0 such that, for all u £ Z4 
and all y £ A with IMIwj IMIy < G 

<P(ii; ?/) < K. 

(A3) For every r > o, there exists L — A(r) G 0 such that, for all u\, U 2 £ ZY 
and all y £ A with 1 1 / i/ ( i ||w, ll w 2 ||w, \\y\\y < r , 


${ui\y) - $(u 2 ;y)\ < L 


U 1 — U2 


U’ 


(A4) For every £ > 0 and r > 0, there exists C = C(e,r) > 0 such that, for 
all u £ U and all i/i, 2/2 £ A with ||i/i||y, || 2/2 lb <r, 


®(w,yi) -${u\ y 2 )\ < exp(e||u||y + C) ||j/i 


2/2 




We first show that, for Gaussian priors, these assumptions yield a well- 
defined posterior measure for each possible instance of the observed data: 

Theorem 6.8. Let <L> satisfy standard assumptions (Al), (A2), and (A3) 
and assume that yo is a Gaussian probability measure on U. Then, for each 
y £ A , y v given by 


dy v exp (~0(u; y)) 

d Mo W “ Z(y) ’ 

Z{y)= / exp(-<£(u;y))d/u 0 (u), 

Ju 

is a well-defined probability measure on U. 

Proof. Assumption (A2) implies that Z(y) is bounded below: 

Z(y) > / exp(-K(r))dfxo(u) = exp(-K(r))(j , 0 [||u|| w < r] > 0 

J {u | \\u\\ u <r} 

for r > 0, since y o is a strictly positive measure on U. By (A3), <L> is 
/io-measurable, and so y y is a well-defined measure. By (Al), for \\y\\y < r 
and e sufficiently small, 

Z(y)= / exp(-#(u;y))d/io(u) 

Ju 

< Cexp(— M(e, r)) < oo, 
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since po is Gaussian and we may choose e small enough that the Fernique 
theorem (Theorem 2.47) applies. Thus, p y can indeed be normalized to be a 
probability measure on U. □ 

Recall from Chapter 5 that the Hellinger distance between two probability 
measures p and v on U is defined in terms of any reference measure p with 
respect to which both p and v are absolutely continuous by 


du{p,n) 




d p(u). 


A particularly useful property of the Hellinger metric is that closeness in the 
Hellinger metric implies closeness of expected values of polynomially bounded 
functions: if /: U -A V, for some Banach space V, then Proposition 5.12 gives 
that 


E m [/] -E„[/]|| <2 ^E4||/P] +E4II/P] d K (ji,v) 


Therefore, Hellinger-close prior and posterior measures give similar expected 
values to quantities of interest; indeed, for fixed /, the perturbation in the 
expected value is Lipschitz with respect to the Hellinger size of the pertur- 
bation in the measure. 

The following theorem shows that Bayesian inference with respect to a 
Gaussian prior measure is well-posed with respect to perturbations of the 
observed data y, in the sense that the Hellinger distance between the corre- 
sponding posteriors is Lipschitz in the size of the perturbation in the data: 


Theorem 6.9. Let<P satisfy the standard assumptions (Al) ; (A2) ; and (A4) ; 
suppose that po is a Gaussian probability measure on U, and that p y <C po 
with density given by the generalized Bayes ; rule for each y E A* Then there 
exists a constant C > 0 such that, for all y , y r E y , 


dn(n v , n v ' ) < C\\y - y'\\y. 


Proof. As in the proof of Theorem 6.8, (A2) gives a lower bound on Z(y). 
We also have the following Lipschitz continuity estimate for the difference 
between the normalizing constants for y and y' \ 


\Z(y)-Z(y f ) 

< 


u 


e ~&(u-,y) _ e ~&(u-,y') 


dpo(u) 


< f maxje &(u-,y ) j |cp(^; ^) — y') | dpo(u) 

Ju 


by the mean value theorem (MVT). Hence, 
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\Z{y)-Z(y')\ 

< f e £ H n ll^+ M • e £ W u \\u+ c — y'\\y d/jL 0 (u) by (Al) and (A4) 

Ju 

< C\\y — y'\\y by Fernique. 

By the definition of the Hellinger distance, using the prior /jlq as the reference 
measure, 


d n (ii y ,ii y ) 2 = [ 

Ju 


f 

l p -&(u;y )/ 2 

1 P -$(. u -,y')/2 

lu 

Vz(y) 

Vz(y 0 


1 


z(y) Ju 

< Jl + /2, 


~®(u\y )/ 2 _ j / ^(y) -<p( U ;y')/ 2 

z{y') 


dfio(u) 


dfio (u) 


where 



2 _ e 


<P(u;y')/2 


2 

d Ho(u), 


h ■= 


y/W) yfW) 


,-®(u;y ')/ 2 


d(io(u) 


U 


For /i, a similar application of the MVT, (Al) and (A4), and the Fernique 
theorem to the one above yields that 


h < 



■P(u W )/ 2 ^ i e -$(« ;J /')/2} 2 . \$(u\y) -<P(u;y')\ 2 dno(u) 


1 

- my 
<c\\y- 



g £ ll n ll U+ M 


y 


1 2 
It 


2e\ 


u 


+2C 


\y - y'Wy d/x 0 (u) 


A similar application of (Al) and the Fernique theorem shows that the inte- 
gral in 7*2 is finite. Also, the lower bound on Z( • ) implies that 


1 


l 

7 W) 


2 

< C max 


1 1 

zW’ Wy 


<C\\y-y'\\ 2 y 


z(y)-z( y ') | 2 


Thus, I 2 < C||y — i/ / ||y, which completes the proof. □ 

Similarly, the next theorem shows that Bayesian inference with respect to 
a Gaussian prior measure is well-posed with respect to approximation of mea- 
sures and log-likelihoods. The approximation of by some <P N typically arises 
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through the approximation of H by some discretized numerical model H N . 
The importance of Theorem 6.10 is that it allows error estimates for the 
forward models H and H N , which typically arise through non-probabilistic 
numerical analysis, to be translated into error estimates for the Bayesian 
inverse problem. 

Theorem 6.10. Suppose that the probability measures /x and p N are the 
posteriors arising from potentials T> and <T N and are all absolutely continuous 
with respect to /xo, and that <P, <P N satisfy the standard assumptions (Al) and 
(A2) with constants uniform in N . Assume also that , for all £ > 0, there exists 
K = K(e) > 0 such that 

|<£(«; y) - $ N {u; y)\ < K exp(e||w||^)^(iV), (6.11) 

where Uhin^oo = 0. Then there is a constant C, independent of N, 

such that 

d. H (n,n N ) <C^j(N). 


Proof. Exercise 6.4. □ 

Remark 6.11. Note well that, regardless of the value of the observed data 
2/, the Bayesian posterior yi y is absolutely continuous with respect to the 
prior fiQ and, in particular, cannot associate positive posterior probabil- 
ity with any event of prior probability zero. However, the Feldman-Hajek 
theorem (Theorem 2.51) says that it is very difficult for probability measures 
on infinite-dimensional spaces to be absolutely continuous with respect to 
one another. Therefore, the choice of infinite-dimensional prior /xo is a very 
strong modelling assumption that, if it is ‘wrong’, cannot be ‘corrected’ even 
by large amounts of data y. In this sense, it is not reasonable to expect that 
Bayesian inference on function spaces should be well-posed with respect to 
apparently small perturbations of the prior /xo, e.g. by a shift of mean that 
lies outside the Cameron-Martin space, or a change of covariance arising from 
a non-unit dilation of the space. Nevertheless, the infinite-dimensional per- 
spective is not without genuine fruits: in particular, the well-posedness results 
(Theorems 6.9 and 6.10) are very important for the design of finite-dimensional 
(discretized) Bayesian problems that have good stability properties with 
respect to discretization dimension N . 


6.4 Accessing the Bayesian Posterior Measure 

For given data y e A, the Bayesian posterior /xo( • \y) on U is determined as a 
measure that has a density with respect to the prior /xo given by Bayes’ rule, 
e.g. in the form (6.10), 
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— (u) oc exp(— 
d/io 

The results outlined above have shown some of the analytical properties 
of this construction. However, in practice, this well-posedness theory is not 
the end of the story, principally because we need to be able to access this 
posterior measure: in particular, it is necessary to be able to (numerically) 
integrate with respect to the posterior, in order to form the posterior expected 
value of quantities of interest. (Note, for example, that (6.10) gives a non- 
normalized density for the posterior with respect to the prior, and this lack 
of normalization is sometimes an additional practical obstacle.) 

The general problem of how to access the Bayesian posterior measure is 
a complicated and interesting one. Roughly speaking, there are three classes 
of methods for exploration of the posterior, some of which will be discussed 
in depth at appropriate points later in the book: 

(a) Methods such as Markov chain Monte Carlo, to be discussed in Chapter 
9, attempt to sample from the posterior directly, using the formula for 
its density with respect to the prior. 

In principle, one could also integrate with respect to the posterior by 
drawing samples from some other measure (e.g. the prior, or some other 
reference measure) and then re-weighting according to the appropriate 
probability density. However, some realizations of the data may cause the 
density d/io(* \y)/dfjLo to be significantly different from 1 for most draws 
from the prior, leading to severe ill-conditioning. For this reason, ‘direct’ 
draws from the posterior are highly preferable. 

An alternative to re-weighting of prior samples is to transform prior sam- 
ples into posterior samples while preserving their probability weights. 
That is, one seeks a function T y : U U from the parameter space U 
to itself that pushes forward any prior to its corresponding posterior, 
i.e. Tj'/io = /xo ( * | y), and hence turns an ensemble . . . , u ( N > } of 

independent samples distributed according to the prior into an ensemble 
{T^fW 1 )), . . . ,T y of independent samples distributed according 

to the posterior. Map-based approaches to Bayesian inference include 
the approach of El Moselhy and Marzouk (2012), grounded in optimal 
transportation theory, and will not be discussed further here. 

(b) A second class of methods attempts to approximate the posterior, often 
through approximating the forward and observation models, and hence 
the likelihood. Many of the modelling methods discussed in Chapters 
10-13 are examples of such approaches. For example, the Gauss-Markov 
theorem (Theorem 6.2) and Linear Kalman Filter (see Section 7.2) pro- 
vide optimal approximations of the posterior within the class of Gaussian 
measures, with linear forward and observation operators. 

(c) Finally, as a catch-all term, there are the ‘ad hoc’ methods. In this cat- 
egory, we include the Ensemble Kalman Filter of Evensen (2009), which 
will be discussed in Section 7.4. 
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6.5 Frequentist Consistency of Bayesian Methods 

A surprisingly subtle question about Bayesian inference is whether it yields 
the ‘correct’ result, regardless of the prior used, when exposed to enough 
sample data. Clearly, when very few data points have been observed, the 
prior controls the posterior much more strongly than the observed data do, 
so it is necessary to answer such questions in an asymptotic limit. It is also 
necessary to clarify what is meant by ‘correctness’. One such notion is that 
of frequentist consistency : 

“While for a Bayesian statistician the analysis ends in a certain sense with the 

posterior, one can ask interesting questions about the properties of posterior-based 

inference from a frequentist point of view.” (Nickl, 2013) 

To describe frequentist consistency, consider the standard setup of a 
Bayesian prior p o on some space U , together with a Bayesian likelihood model 
for observed data with values in another space A, he. a family of probability 
measures /i( • | u) E Mi (A) indexed by u E U. Now introduce a new ingredi- 
ent, which is a probability measure p^ E Mi (A) that is treated as the ‘truth’ 
in the sense that the observed data are in fact a sequence of independent and 
identically distributed draws from pf . 

Definition 6.12. The likelihood model {p(- \u) \ u E U} is said to be well- 
specified if there exists some v) E U such that p^ = p(-\u^), i.e. if there 
is some member of the model family that exactly coincides with the data- 
generating distribution. If the model is not well-specified, then it is said to 
be mis specified. 

In the well-specified case, the model and the parameter space U admit 
some vf that explains the frequentist ‘truth’ pf . The natural question to 
ask is whether exposure to enough independent draws Yi,...,Y^ from pf 
will permit the model to identify vf out of all the other possible u E U. If 
some sequence of estimators or other objects (such as Bayesian posteriors) 
converges as n oo to v) with respect to some notion of convergence, 
then the estimator is said to be consistent. For example, Theorem 6.13 gives 
conditions for the maximum likelihood estimator (MLE) to be consistent, 
with the mode of convergence being convergence in probability; Theorem 
6.17 (the Bernstein- von Mises theorem) gives conditions for the Bayesian 
posterior to be consistent, with the mode of convergence being convergence 
in probability, and with respect to the total variation distance on probability 
measures. 

In order to state some concrete results on consistency, suppose now that 
U C M. p and y C R d , and that the likelihood model {p( • \u) \ u E U} can be 
written in the form of a parametric family of probability density functions 
with respect to Lebesgue measure on M d , which will be denoted by a function 

/(• | -) : y x W -> [0, oo), i.e. 


108 


6 Bayesian Inverse Problems 


y(E\u) 



f{y\u) d y 


for each measurable E C y and each u E W. 


Before giving results about the convergence of the Bayesian posterior, we 
first state a result about the convergence of the maximum likelihood estimator 
(MLE) u n for u ^ given the data Yi, . . . , Y n ~ fi\ which, as the name suggests, 
is defined by 

u n e arg max f(Yi\u) • ■■ f(Y n \u). 
u cm 

Note that, being a function of the random variables Yi, . . . , Y n , u n is itself a 
random variable. 

Theorem 6.13 (Consistency of the MLE). Suppose that f(y\u) > 0 for all 
(u, y) GW x y , that U is compact, and that parameters u <EU are identifiable 
in the sense that 


/( • |^o ) = /( • \ui) Lebesgue a.e. uq = u\ 


and that 


/ sup 

y ueU 


\ogf(y\u)\f(y\u r )dy < oo. 


Then the maximum likelihood estimator u n converges to u t in probability, 
i.e. for all £ > 0, 







> £ 


> 0 . 

oo 


(6.12) 


The proof of Theorem 6.13 is omitted, and can be found in Nickl (2013). 
The next two results quantify the convergence of the MLE and Bayesian 
posterior in terms of the following matrix: 

Definition 6.14. The Fisher information matrix i-p(u^) E W )Xp of / at 
E U is defined by 



Y~f( • |n f ) 


9 log f(Y\u) 9 log f(Y\u) 


dui duj 

u = ut - 


(6.13) 


Remark 6.15. Under the regularity conditions that will be used later, 
ip(u^) is a symmetric and positive-definite matrix, and so can be viewed 
as a Riemannian metric tensor on U , varying from one point u t E U to 
another. In that context, it is known as the Fisher-Rao metric tensor , and 
plays an important role the field of information geometry in general, and 
geodesic Monte Carlo methods in particular. 

The next two results, the lengthy proofs of which are also omitted, are 
both asymptotic normality results. The first shows that the error in the 
MLE is asymptotically a normal distribution with covariance operator given 
by the Fisher information; informally, for large n, u n is normally distributed 
with mean rC and precision mp^)- The second result — the celebrated 
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Bernstein-von Mises theorem or Bayesian CLT (central limit theorem) — 
shows that the entire Bayesian posterior distribution is asymptotically a nor- 
mal distribution centred on the MLE, which, under the conditions of Theorem 
6.13, converges to the frequent ist ‘truth’. These results hold under suitable 
regularity conditions on the likelihood model, which are summarized here for 
later reference: 

Regularity Assumptions. The parametric family / : y x U — >• [0, oo) will 
be said to satisfy the regularity assumptions with respect to a data-generating 
distribution p) E if 

(a) for all u E 1A and y E y, f(y\u ) > 0; 

(b) the model is well-specified, with p) = p(-\vf), where v) is an interior 
point of U ; 

(c) there exists an open set U with vf E U C U such that, for each y E y, 
f(y\-)£C 2 (U;R); 

(d) [V 2 log /(T |ii)| n=n t] E R pXp is non-singular and 


E Y 




V„ log /(F|u)| t 


< oo; 


(e) there exists r 0 such that B — C U and 


E 




sup V 2 u log f(Y\u) 

uFB 


< OO, 


/ sup 

)y ueB 

Vulog f(Y\u) 

/ sup 

Jy ueB 

vpog/(y|u) 


Theorem 6.16 (Local asymptotic normality of the MLE). Suppose that 
f satisfies the regularity assumptions. Then the Fisher information matrix 
(6.13) satisfies 


iF{^)ij • |u+) 


d 2 log f(Y\u) 


duiduj 

U — U t - 


and the maximum likelihood estimator satisfies 

d 


y/n(u n — » X ~ jM(0, zf(^) 1 ), 


n— >• oo 


(6.14) 


where A denotes convergence in distribution (also known as weak conver- 
gence, q.v. Theorem 5.14), he. X n A X if K[ip(X n )\ — )► K[ip(X)] for all 
bounded continuous functions <p: R p M. 

Theorem 6.17 (Bernstein-von Mises). Suppose that f satisfies the regularity 
assumptions. Suppose that the prior po C FA\ (U) is absolutely continuous 
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with respect to Lebesgue measure and has u t E supp(/io)- Suppose also that 
the model admits a uniformly consistent estimator, i.e. a T n : y n ML such 
that, for all £ > 0, 


SUp F Yi ~f(-\ u ) 
u£lA 


T n (Yi , . . . , Y n ) — u 


> £ 


» 0. 

n— )>oo 


(6.15) 


Let [i n := /io(* |Ul, . . . ,Y n ) denote the (random) posterior measure obtained 
by conditioning (i o on n independent /jf -distributed samples Y - L . Then, for all 
£ > 0 , 


p 




(in ff ( U n , 


n 


> £ 


TV 


■» 0 . 


n— )>oo 


(6.16) 


The Bernstein-von Mises theorem is often interpreted as saying that so 
long as the prior /iq is strictly positive — i.e. puts positive probability mass 
on every open set in U — the Bayesian posterior will asymptotically put all 
its mass on the frequentist ‘truth’ vf (assuming, of course, that v) E IX). 
Naturally, if u ^ ^ supp(/io), then there is no hope of learning u ^ in this 
way, since the posterior is always absolutely continuous with respect to the 
prior, and so cannot put mass where the prior does not. Therefore, it seems 
sensible to use ‘open-minded’ priors that are everywhere strictly positive; 
Lindley (1985) calls this requirement “Cromwell’s Rule” in reference to Oliver 
Cromwell’s famous injunction to the Synod of the Church of Scotland in 1650: 


“I beseech you, in the bowels of Christ, think it possible that you may be mistaken.” 


Unfortunately, the Bernstein-von Mises theorem is no longer true when 
the space U is infinite-dimensional, and Cromwell’s Rule is not a sufficient 
condition for consistency. In infinite-dimensional spaces, there are counterex- 
amples in which the posterior either fails to converge or converges to some- 
thing other than the ‘true’ parameter value — the latter being a particularly 
worrisome situation, since then a Bayesian practitioner will become more and 
more convinced of a wrong answer as more data come in. There are, however, 
some infinite-dimensional situations in which consistency properties do hold. 
In general, the presence or absence of consistency depends in subtle ways 
upon choices such as the topology of convergence of measures, and the types 
of sets for which one requires posterior consistency. See the bibliography at 
the end of the chapter for further details. 
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(2015) and others. There are also positive results for infinite-dimensional set- 
tings, such as those of Castillo and Nickl (2013, 2014) and Szabo et al. (2014, 
2015). It is now becoming clear that the crossover from consistency to incon- 
sistency depends subtly upon the topology of convergence and the geometry 
of the proposed credible/confidence sets. 


6.7 Exercises 

Exercise 6.1. Let p\ = M(mi,Ci) and p 2 = be non-degenerate 

Gaussian measures on M n with Lebesgue densities pi and p2 respectively. 
Show that the probability measure with Lebesgue density proportional to 
pip2 is a Gaussian measure ps = C 3 ), where 

C3- 1 = cp + cp, 

m 3 = C3(Ci 1 mi + Cp to 2 ). 


Note well the property that the precision matrices sum, whereas the covari- 
ance matrices undergo a kind of harmonic average. (This result is sometimes 
known as completing the square.) 

Exercise 6.2. Let po be a Gaussian probability measure on M n and sup- 
pose that the potential <P(u; y ) is quadratic in u. Show that the posterior 
d p y oc dpo is also a Gaussian measure on M n . Using whatever char- 

acterization of Gaussian measures you feel most comfortable with, extend this 
result to a Gaussian probability measure po on a separable Banach space U. 

Exercise 6.3. Let r E M gxg be symmetric and positive definite. Suppose 
that H : hi — x satisfies 

(a) For every e > 0, there exists M E R such that, for all u E U, 

ll^Pllr- 1 < exp(e||u||^ + M) . 

(b) For every r > 0, there exists K > 0 such that, for all G U with 

||^l || U, || ^2 ||zi < U 


|| H[ui) - H[u2)\\r-^ <K 

Show that < P: U xM ? oR defined by 


1/1 


1/2 


U' 


*(«; y ) ■= \{v - h{u), r~\ y - h(u))) 
satisfies the standard assumptions. 

Exercise 6.4. Prove Theorem 6.10. Hint: follow the model of Theorem 6.9, 
with (p,p N ) in place of (p y , p y ), and using ( 6 . 11 ) instead of (A4). 


Chapter 7 

Filtering and Data Assimilation 


It is not bigotry to be certain we are right; 
but it is bigotry to be unable to imagine how 
we might possibly have gone wrong. 


The Catholic Church and Conversion 
G. K. Chesterton 


Data assimilation is the integration of two information sources: 

• a mathematical model of a time-dependent physical system, or a numer- 
ical implementation of such a model; and 

• a sequence of observations of that system, usually corrupted by some 
noise. 

The objective is to combine these two ingredients to produce a more accurate 
estimate of the system’s true state, and hence more accurate predictions of the 
system’s future state. Very often, data assimilation is synonymous with fil- 
tering , which incorporates many of the same ideas but arose in the context of 
signal processing. An additional component of the data assimilation/filtering 
problem is that one typically wants to achieve it in real time: if today is 
Monday, then a data assimilation scheme that takes until Friday to produce 
an accurate prediction of Tuesday’s weather using Monday’s observations is 
basically useless. 

Data assimilation methods are typically Bayesian, in the sense that the 
current knowledge of the system state can be thought of as a prior, and the 
incorporation of the dynamics and observations as an update/conditioning 
step that produces a posterior. Bearing in mind considerations of computa- 
tional cost and the imperative for real time data assimilation, there are two 
key ideas underlying filtering: the first is to build up knowledge about the 
posterior sequentially, and hence perhaps more efficiently; the second is to 
break up the unknown state and build up knowledge about its constituent 
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parts sequentially, hence reducing the computational dimension of each sam- 
pling problem. Thus, the first idea means decomposing the data sequentially, 
while the second means decomposing the unknown state sequentially. 

A general mathematical formulation can be given in terms of stochastic 
processes. Suppose that T is an ordered index set, to be thought of as ‘time’; 
typically either T = No or T = [0, oo) C R. It is desired to gain information 
about a stochastic process A : T defined over a probability space 

(e,&, f i) and taking values in some space X, from a second stochastic process 
Y: T x O -A y. The first process, A, represents the state of the system, 
which we do not know but wish to learn; the second process, T, represents 
the observations or data ; typically, Y is a lower-dimensional and/or corrupted 
version of X. 

Definition 7.1. Given stochastic processes X and Y as above, the filtering 
problem is to construct a stochastic process X : T such that 

• the estimate X t is a ‘good’ approximation to the true state At, in a sense 
to be made precise, for each t E T; and 

• A is a -adapted process, i.e. the estimate X t of X t depends only upon 
the observed data Y s for s <t, and not on as-yet-unobserved future data 
Y s with s > t. 

To make this problem tractable requires some a priori information about 
the state process A, and how it relates to the observation process T, as well as 
a notion of optimality. This chapter makes these ideas more concrete with an 
L 2 notion of optimality, and beginning with a discrete time filter with linear 
dynamics for A and a linear map from the state X t to the observations Y t : 
the Kalman filter. 


7.1 State Estimation in Discrete Time 

In the Kalman filter, the probability distributions representing the system 
state and various noise terms are described purely in terms of their mean 
and covariance, so they are effectively being approximated as Gaussian dis- 
tributions. 

For simplicity, the first description of the Kalman filter will be of a con- 
trolled linear dynamical system that evolves in discrete time steps 


to < ti < • • • < t k < 

The state of the system at time tk is a vector Xk in a Hilbert space A, and it 
evolves from the state Xk-i G A at time tk - i according to the linear model 

Xk = FkXk-i + GkUk + fk (7.1) 


where, for each time tk, 
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• Fk : X X is the state transition model, which is a linear map applied 
to the previous state Xk-i G X\ 

• Gk'.U. X is the control-to-input model, which is applied to the control 
vector Uk in a Hilbert space IA\ and 

• is the process noise, an A- valued random variable with mean 0 and 
(self-adjoint, positive-definite, trace class) covariance operator 

Qk : X — > X . 

Naturally, the terms FkXk-i and GkUk can be combined into a single term, 
but since many applications involve both uncontrolled dynamics and a control 
Uk, which may in turn have been derived from estimates of x# for £ < fc, the 
presentation here will keep the two terms separate. 

At time tk an observation yk in a Hilbert space y of the true state Xk is 
made according to 

Vk — H-kXk H - Vki (7.2) 

where 

• Hk : X y is the linear observation operator ; and 

• r\k ~ jV(0, Rk) is the observation noise, a A- valued random variable with 
mean 0 and (self-adjoint, positive-definite, trace class) covariance opera- 
tor Q k : y -> y. 

As an initial condition, the state of the system at time to is taken to be 
xo ~ m o + £ 0 , where mo G X is known and £o is an A-valued random 
variable with (self-adjoint, positive-definite, trace class) covariance operator 
Q q: X X. All the noise vectors are assumed to be mutually and pairwise 
independent. 

As a preliminary to constructing the actual Kalman filter, we consider 
the problem of estimating states aq, . . . ,Xk given the corresponding controls 
u \, . . . , Uk and £ known observations 2 / 1 , • • • , ye, where k > £. In particular, 
we seek the best linear unbiased estimate of #i, . . . , Xk- 

Remark 7.2. If all the noise vectors are Gaussian, then since the forward 
dynamics (7.1) are linear, Exercise 2.5 then implies that the joint distribu- 
tion of all the Xk is Gaussian. Similarly, if the rjk are Gaussian, then since 
the observation relation (7.2) is linear, the yk are also Gaussian. Since, by 
Theorem 2.54, the conditional distribution of a Gaussian measure is again 
a Gaussian measure, we can achieve our objective of estimating xi , . . . ,Xk 
given 2 / 1 , . . . , yi using a Gaussian description alone. 

In general, without making Gaussian assumptions, note that (7.1)-(7.2) is 
equivalent to the single equation 

bk\i = Ak\i z k + (7.3) 


where, in block form, 
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m 0 


_ £o 

G\ 


-6 

yi 


+m 

• 


x 0 



GiUi 

, := 

• 



ye 


Xk 


+m 

Gi+iiii+i 


-&+i 

G k u k 


_ _ Cfe . 


and A k \i is 


A 


k \i : 


7 0 0 

—F\ I 0 

0 i7i 0 

0 -F 2 I 

0 0 h 2 


0 

0 

0 

0 

0 


0 


0 


-F t I '• : 

0 Hi 0 : 

0 —Fi+ i 7 '• : 

0 

0 — F k 7_ 


Note that the noise vector is X k+1 x ^-valued and has mean zero and 
block-diagonal positive-definite precision operator (inverse covariance) W k \t 
given in block form by 


W k \ e := diag(Q 0 1 ,Q i 1 ,R 1 1 ,...,Q e 1 ,R e t Q e + 1: ■ ■ ■ , Q k 1 ). 

By the Gauss-Markov theorem (Theorem 6.2), the best linear unbiased 
estimate z k \t = [£o|D • • • , £/cp]* of z k satisfies 


Zk \l G argmin J k \e(zk), Jk\i{zk ) 

z^GX 


1/1 i 2 


W) 


fc|£ 


5 


(7.4) 
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and by Lemma 4.27 is the solution of the normal equations 


Ak\iWk\£A k \£Z k \£ — A* k \ £ W k \£b k \£. 

By Exercise 7.1, it follows from the assumptions made above that these 
normal equations have a unique solution 

Zk\t = {Ak\eW k \£A k \ £ ) A* k \ £ W k \ib k \i. (7.5) 

By Theorem 6.2 and Remark 6.3, = z k and the covariance operator 

of the estimate z k ^ is note that this covariance operator is 

exactly the inverse of the Hessian of the quadratic form J k 

Since a Gaussian measure is characterized by its mean and variance, a 
Bayesian statistician forming a Gaussian model for the process (7.1)-(7.2) 
would say that the state history z k = (xo, . . . , x k ), conditioned upon the 
control and observation data b k \e, is the Gaussian random variable with dis- 
tribution J\f(z k{e , (Al^W^A^)- 1 ) . 

Note that, since W k \i is block diagonal, J k ^ can be written as 


1 1 G a 

Jk\e(zk) = ~\\x 0 - too || q-i + - - HiXi 


R 


-1 


i — 1 


1 y— 

+ - Ell^i - FiXi-i - GiUi 


Q7 1 ' 


(7.6) 


i—1 


An expansion of this type will prove very useful in derivation of the linear 
Kalman filter in the next section. 


7.2 Linear Kalman Filter 

We now consider the state estimation problem in the common practical sit- 
uation that k = m. Why is the state estimate (7.5) not the end of the story? 
For one thing, there is an issue of immediacy: one does not want to have to 
wait for observation yiooo to come in before estimating states x \ , . . . , £999 as 
well as oqooce i n particular because the choice of the control u k +i typically 
depends upon the estimate of x k ] what one wants is to estimate x k upon obs- 
erving y k . However, there is also an issue of computational cost, and hence 
computation time: the solution of the least squares problem 
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where A £ K mXn , } eas ^ by direct methods such as solving the normal 
equations or QR factorization, requires of the order of mn 2 floating-point 
operations. Hence, calculation of the state estimate z k by direct solution of 
(7.5) takes of the order of 

((fc + l)(dim X) + m(dim A)) ((fc + 1) dim A') 2 

operations. It is clearly impractical to work with a state estimation scheme 
with a computational cost that increases cubically with the number of time 
steps to be considered. The idea of filtering is to break the state estimation 
problem down into a sequence of estimation problems that can be solved with 
constant computational cost per time step, as each observation comes in. 

The two-step linear Kalman filter (LKF) is an iterative 1 method for con- 
structing the best linear unbiased estimate x k \ k (with covariance operator 
CW) of x k in terms of the previous state estimate x k _i\ k _ 1 and the data u k 
and y k . It is called the two-step filter because the process of updating the state 
estimate (£fc_;L|fc_i, C k _i\ k _i) for time t k - 1 into the estimate (x k \ k ,C k \ k ) for 
t k is split into two steps (which can, of course, be algebraically unified into a 
single step): 

• the prediction step uses the dynamics but not the observation y k to 
update (£fc_i|fc_i, Cfc_i|fc_i) into an estimate (x k \ k _i, C k \k-i) for the 
state at time t k ; 

• the correction step uses the observation y k but not the dynamics to 
update (x k \ k -i,C k \ k ~i) into a new estimate (x k \ k ,C k \ k ). 

It is possible to show, though we will not do so here, that the computational 
cost of each iteration of the LKF is at most a constant times the computa- 
tional cost of matrix-matrix multiplication. 

The literature contains many derivations of the Kalman filter, but there 
are two especially attractive viewpoints. One is to view the LKF purely as 
a statement about the linear push-forward and subsequent conditioning of 
Gaussian measures; in this paradigm, from a Bayesian point of view, the 
LKF is an exact description of the evolving system and its associated unc- 
ertainties, under the prior assumption that everything is Gaussian. Another 
point of view is to derive the LKF in a variational fashion, forming a sequence 
of Gauss-Mar kov-like estimation problems, and exploiting the additive de- 
composition (7.6) of the quadratic form that must be minimized to obtain 
the best linear unbiased estimator. One advantage of the variational point 
of view is that it forms the basis of iterative methods for non-linear filtering 
problems, in which Gaussian descriptions are only approximate. 

Initialization. We begin by initializing the state estimate as 

(Xo|o,Co|o) == (m 0 ,Qo). 

1 That is, the LKF is iterative in the sense that it performs the state estimation sequentially 
with respect to the time steps; each individual update, however, is an elementary linear 

algebra problem, which could itself be solved either directly or iteratively. 
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In practice, one does not usually know the initial state of the system, or 
the concept of an ‘initial state’ is somewhat arbitrary (e.g. when tracking an 
astronomical body such as an asteroid). In such cases, it is common to use a 
placeholder value for the mean x 0 |o and an extremely large covariance Cgio 
that reflects great ignorance/open-mindedness about the system’s state at 
the start of the filtering process. 

Prediction: Push-Forward Method. The prediction step of the LKF is 

simply a linear push- forward of the Gaussian measure Af(x k - i\k-i, C/c— i|/c— 1 ) 
through the linear dynamical model (T.l). By Exercise 2.5, this push-forward 
measure is Af (x k \ k _i,C k \ k -i), where 


&k\k — 1 • F k X k _ l\k— 1 T G k U kl (7*7) 

c k | fe _i := F k C k -i\ k -iF^ + Qk. (7.8) 


These two updates comprise the prediction step of the Kalman filter, the 
result of which can be seen as a Bayesian prior for the next step of the 
Kalman filter. 


Prediction: Variational Method. The prediction step can also be char- 
acterized in a variational fashion: x m \k-i should be the best linear unbiased 
estimate of x k given yo, . . . ,y k - 1 , he. it should minimize J k \k- i- Recall the 
notation from Section 7.1: a state history z k G is a fc-tuple of states 

(ffo, • • •,£*)• Let 



0 Fk , 


with F k in the k th block, i.e. the block corresponding to x k -i, so that 
FkZk-i = F k x k - 1 . By (7.6), 


1 

Jk\k—l( z k) — J k -l\k-l( z k-l) T ^ X k F k Z k — \ G k U k 


Q 


-i 


The gradient and Hessian of J k \ k ~\ are given (in block form, splitting z k into 
z k ~ i and x k components) by 


VJ fc | fc _i (z k ) 


V 2 J k \k-i(z k ) — 


^ J k -i\ k -i{ z k-i) + F k Q k 1 (F k z k - 1 + G k u k 
Q k (F k z k -i T G k u k x k ) 

^Jk-Mk-i^k-iHF^Fk -F*Q k x 

-Q k lF k Q k 1 


— Xk) 


It is readily verified that 


VJ*| fc _i(**) 0 ^ z k z k\k — l (*^0|0 7 * • * ? ^k — l\k— 1 1 Xk\k— 1 ) i 


with x k \k~\ as in (7.7). We can use this z k \k-i as the initial condition for (and, 
indeed, fixed point of) a single iteration of the Newton algorithm, which 
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by Exercise 4.3 finds the minimum of Jk\k - 1 i n one step; if the dynamics 
were nonlinear, z k \k- 1 would still be a sensible initial condition for Newton 
iteration to find the minimum of J k \k-ii but might not be the minimizer. 
The covariance of this Gauss-Mar kov estimator for x k is the bottom-right 
block of (V 2 Jk\k-i{ z k)) '• by the inversion lemma (Exercise 2.7) and the 

inductive assumption that the bottom-right (z k - i) block of (V 2 Jk-i\k-i) 
is the covariance of the previous state estimate x k _ i\ k -i, 

C k \k -1 = Qk + F k (y 2 J k -i\k-i(z k -i\ k -i)) 1 F'k by Exercise 2.7 
= Qk + F k C k -i\k-iFk » by induction, 

just as in (7.8). 

Correction: Conditioning Method. The next step is a correction step 
(also known as the analysis or update step ) that corrects the prior dis- 
tribution N(xk\k-h CW_i) to a posterior distribution N(x k \ k ,C k \ k ) using 
the observation y k . The key insight here is to observe that x k \ k _ 1 and y k are 
jointly normally distributed, and the observation equation (7.2) defines the 
conditional distribution of y k given x k \ k _i = x as J\f(H k x , R k ), and hence 

{Vk\Xk\k-l ^ -^"(^k\k — l ? G k \ k — i)) ^ J\f(H k x k \ k — i , H k C k \ k _\H k T Rr) • 

The joint distribution of x k \ k ~i and y k is, in block form, 


%k\k — 1 
Vk 


~ A f 


%k\k — l 

H k x k \ k -i 


C k \k-i 

HkC k \ k _i 


C 


k\k-l 


Hi 


H k C k \ k — 1 Hfo T R 



Theorem 2.54 on the conditioning of Gaussian measures now gives the con- 
ditional distribution of x k given y k as N(x k \ k ,C k \ k ) with 

Xk\k = x k \k-i + C k \k-iHlSp{y k - H k x k \k-i) (7.9) 


and 


r i ri r i tt* q— 1 tj r < 

^k\k — ^k\k-l ~ ^^k-l-^k^k n k^k\k-l 


(7.10) 


where the self-adjoint and positive-definite operator S k : y -A y defined by 


S k := H k a 


k\k-l 


EJ* 

H k 


+ Ri 


(7.11) 


is known as the innovation covariance. 

Another expression for the posterior covariance CW, or rather the poste- 
rior precision CT^, can be easily obtained by applying the Woodbury formula 
(2.9) from Exercise 2.7 to (7.10): 
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ri— 1 ( ri r i z_r* c — 1 zj C Y \ 

^ k\k y^k\k — 1 ^k\k — 1-^-k^k -^-k^k\k — l) 


-1 


r^— 1 | i_r* / c U f i r^— 1 rr*\ ^ tt 

^/c|/e— 1 T ^k \^k -tlk^klk—l^^k — i^klk—l-tlk) ^k 

= c^ k _ x + ^ 

= c-L 1 + a fc *i? fc - 1 a fc . 


Hi 


(7.12) 


Application of this formula gives another useful expression for the posterior 
mean x k \ k : 

%k\k %k\k — i A C k \ k H k R k (i) k H k x k \ k _ i ) . (7.13) 

To prove the equivalence of (7.9) and (7.13), it is enough to show that 
C k \ k HlR k l = C^k-iH^S^ 1 , and this follows easily after multiplying on the 
left by on the right by S k , inserting (7.12) and (7.11), and simplifying 
the resulting expressions. 


Correction: Variational Method. As with the prediction step, the correc- 
tion step can be characterized in a variational fashion: x k \ k should be the best 
linear unbiased estimate x k \ k of x k given t/o, • • • ,Hk, be. it should minimize 
Jk\k' Let 



0 H k , 


with H k in the ( k + l) st block, i.e. the block corresponding to x k , so that 
H k z k = H k x k . By (7.6), 


Jk\k^k) Jk\k — l^k) T ~ H k Z k y k 


R ; 


The gradient and Hessian of J k \ k are given by 


^ Jk\k(z k ) = ^ Jk\k—l(Zk) T H k R k ( H k Z k 1//;,) 

= VJz c |z c _i(z/ c ) T- H k R k {HkXk Hk) i 

V 2 Jk\k( z k) = V 2 J/c|fc-l (z k ) + H k R k 1 H k . 


Note that, in block form, the Hessian of J k \ k is that of J k \ k -i plus a Tank 
one update’: 


^ Jk\k(z k ) 


V 2 Jk\k-l(Zk) + 


0 

0 


0 


H k R k 


-1 



By Exercise 4.3, a single Newton iteration with any initial condition will 
find the minimizer x k \ k of the quadratic form J k \ k . A good choice of initial 
condition is 

Zk Z k |/e— i (T 0 |0 7 • • • 5 *b/e — l\k — 1 1 %k\k — l)? 

so that ^ J k \ k -i(z k ) vanishes and the bottom-right block of V 2 Tzc|fe-i(^/c) _1 
I s ^k\k — l • 
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The bottom-right (z k ) block of (V 2 J k \k( z k)) 1 , he. the covariance operator 
C k |k, can now be found by blockwise inversion. Observe that, when A, B, C 
and D + D' satisfy the conditions of the inversion lemma (Exercise 2.7), we 
have 


" A B 

-1 

* * 

C D + D’ 


* (D' + (D + CA~ 1 B))~ 1 


where * denotes entries that are irrelevant for this discussion of bottom-right 
blocks. Now apply this observation with 

= V 2 J fc | *_!(**) and D' = H* k R^H k , 

so that D + CA~ 1 B = C k ^ k _ l . Therefore, with this choice of initial condition 
for the Newton iteration, we see that CW, which is the bottom-right block 

of (y 2 J k\ki z k)) \ is + H k R k 1 Hk) \ in accordance with (7.13). 

The Kalman Gain. The correction step of the Kalman filter is often phrased 
in terms of the Kalman gain K k : y -a X defined by 

K k := C k \k—i H k S k 1 = Ck^HKHkCu^Hi + R k )~ 

With this definition of K k , 

%k\k ^k\k—l T Kk(jjk HkX^) 

C k \ k = (/ - K k H k )C k \ k _i = <?*,*_! - K k S k Kl 
It is also common to refer to 

Vk • Vk H k x k ^ k —i 
as the innovation residual , so that 

%k\k %k\k—i T K k y k . 

Thus, the role of the Kalman gain is to quantify how much of the innovation 
residual should be used in correcting the predictive estimate x k \ k _i. It is an 
exercise in algebra to show that the first presentation of the correction step 
(7.9)-(7.10) and the Kalman gain formulation (7. 14)— (7. 16) are the same. 

Example 7.3 (LKF for a simple harmonic oscillator). Consider the simple 
harmonic oscillator equation x(t) = — cc 2 x(t), with cc > 0. Given a time step 
At > 0, this system can be discretized in an energy-conserving way by the 
semi-implicit Euler scheme 

x k = x k -i + v k At , 
v k = v k -\ - u?x k -\At. 


(7.14) 

(7.15) 

(7.16) 


A B 
C D 
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Note that position, x, is updated using the already-updated value of 
velocity, v. The energy conservation property is very useful in practice, and 
has the added advantage that we can use a relatively large time step in Figure 
7.1 and thereby avoid cluttering the illustration. We initialize this oscillator 
with the initial conditions (#o,^o) = (1,0). 

Suppose that noisy measurements of the ^-component of this oscillator are 
made at each time step: 

yk=x k +rj k , r? fc ~V( 0,1/2). 


For illustrative purposes, we give the oscillator the initial position and velocity 
(x(0),f( 0)) = (1,0); note that this makes the observational errors almost of 
the same order of magnitude as the amplitude of the oscillator. The LKF is 
initialized with the erroneous estimate (^o|oWo|o) = (0,0) and an extremely 
conservative covariance of C 0 |o = 10 10 /. In this case, there is no need for 
control terms, and we have Qk = 0, 


F k = 


1 -io 2 At 2 
—uj 2 At 


At 





The results, for uj = 1 and At = are illustrated in Figure 7.1. The initially 
huge covariance disappears within the first iteration of the algorithm, rapidly 
producing effective estimates for the evolving position of the oscillator that 
are significantly more accurate than the observed data alone. 


Continuous Time Linear Kalman Filters. The LKF can also be formu- 
lated in continuous time, or in a hybrid form with continuous evolution but 
discrete observations. For example, the hybrid LKF has the evolution and 
observation equations 

x(t) = F(t)x(t) + G(t)u(t) + w(t), 

Vk — Ftk%k T Vk, 

where := #(£&). The prediction equations are that xm-\ is the solution 
at time of the initial value problem 

= F(t)x(t) + G(t)u(t ), 

X(tk-l) *F/c — 1 1 k — 1 5 

and that Ck\k-i is the solution at time tk of the initial value problem 

C(t) = F(t)P(t)F(t)*+Q(t), 

G(tk~ i) = C k -\ \k-i- 
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%begincenter 



The dashed curve shows the true evolution of position. The solid black curve 
shows the filtered mean estimate of the position; the grey envelope shows the mean 
=b one standard deviation. The black crosses show the observed data. 

b 


error 



The filtered prediction errors \xk\k — %k\ (solid curve) are consistently smaller 
than the observation errors | — x^\ (dashed curve). 

Fig. 7.1: The LKF applied to a simple harmonic oscillator, as in Example 
7.3. Despite the comparatively large scatter in the observed data, as shown 
in (a), and the large time step [At = ^), the LKF consistently provides 
better estimates of the system’s state than the data alone, as shown in (b). 


The correction equations (in Kalman gain form) are as before: 

K k = C^H^HkC^Hl + Rk)- 1 

T/e|/c %k\k—l + K-k(jJk H-k%k\k—l) 

Ck\k = (I - K k H k )C k \ k _ 1 . 

The LKF with continuous time evolution and observation is known as the 
Kalman-Bucy filter. The evolution and observation equations are 


x(t) = F(t)x(t ) + G(t)u(t) + w(t), 
y(t) = H(t)x(t) +v(t). 


7.3 Extended Kalman Filter 
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Notably, in the Kalman-Bucy filter, the distinction between prediction and 
correction does not exist. 

Kf- = F{t)x(t) + G(t)u(t) +K(t)(y(t) - H(t)x(t)), 

C(t) = F(t)C(t) + C(t)F(t)* + Q(t) - K(t)R(t)K(tf, 

where 

K(t) :=C(t)H(tyR(t)~ l . 


7.3 Extended Kalman Filter 

The extended Kalman filter (ExKF or EKF) is an extension of the Kalman 
filter to nonlinear dynamical systems. In discrete time, the evolution and 
observation equations are 


%k = Uk) +6c, 

Vk — hk{%k) T Vki 

where, as before, x k e X are the states, Uk G U are the controls, y k e y 
are the observations, f k : X x U — X are the vector fields for the dynamics, 
hk : X -> y are the observation maps, and the noise processes £& and rjk 
are uncorrelated with zero mean and positive-definite covariances Qk and Rk 
respectively. 

The classical derivation of the ExKF is to approximate the nonlinear 
evolution-observation equations with a linear system and then use the LKF 
on that linear system. In contrast to the LKF, the ExKF is neither the 
unbiased minimum mean-squared error estimator nor the minimum vari- 
ance unbiased estimator of the state; in fact, the ExKF is generally biased. 
However, the ExKF is the best linear unbiased estimator of the linearized 
dynamical system, which can often be a good approximation of the nonlinear 
system. As a result, how well the local linear dynamics match the nonlinear 
dynamics determines in large part how well the ExKF will perform. Indeed, 
when the dynamics are strongly nonlinear, all approximate Gaussian filters 
(including KF-like methods) perform badly, since the push- forward of the 
previous state estimate (a Gaussian measure) by a strongly nonlinear map is 
poorly approximated by a Gaussian. 

The approximate linearized system is obtained by first-order Taylor 
expansion of fk about the previous estimated state x k _i\ k _i and hk about 
%k\k — l 

%k = fk(%k-l\k-l, u k) + D/fc(£fc_i|fc_i, Uk)(Xk - 1 — Xk-l\k- 1) + 

Hk = hk(x k \k-i) + T)hk(xk\k—i){xk — Xk\k—i) + hk- 
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Taking 


Fk - = ^fk {%k—l\k— 1 5 ^k) -> 

Hk - = D/ifc (^/c|/c— 1 ) -> 

• fk iXk— l\k— > 1 5 ^k) FfcXk — l\k— 1 

yk - /ife(ife|fe-i) zrjfe£fc|fc— i, 


the linearized system is 


Xk = F/eX/c-i + Uk + & 
Uk = HkXk + Vk + Vk- 


The terms fz*, and jjk can be seen as spurious control forces and observations 
respectively, induced by the errors involved in approximating fk and hk by 
their derivatives. The ExKF is now obtained by applying the standard LKF 
to this system, treating Uk as the controls for the linear system and yk — Vk 
as the observations, to obtain 


&k\k—l fk {p^k — l\k — 1 ? ^/c), 


(7.17) 

(7.18) 

(7.19) 

(7.20) 



7.4 Ensemble Kalman Filter 

The EnKF is a Monte Carlo approximation of the Kalman filter that avoids 
evolving the covariance operator of the state vector x G T, and thus elim- 
inates the computational costs associated with storing, multiplying and 
inverting the matrix representation of this operator. These computational 
costs can be huge: in applications such as weather forecasting, dimT can 
easily be of order 10 6 to 10 9 . Instead, the EnKF uses an ensemble of E G N 
state estimates x ^ G X, e = 1, . . . , E, arranged into a matrix 



The columns of the matrix X are the ensemble members. 

Initialization. The ensemble is initialized by choosing the columns of X 0 |o 
to be E independent draws from, say, A/*(rao, Qo)- However, the ensemble 
members are not generally independent except in the initial ensemble, since 
every EnKF step ties them together, but all the calculations proceed as if 
they actually were independent. 


7.4 Ensemble Kalman Filter 
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Prediction. The prediction step of the EnKF is straightforward: each column 
x^ 1 \ k _ 1 is evolved to Y\k_i using the LKF prediction step (7.7) 

^ (e) 7— i a. (e) . 

X k\k-1 = Fk% k -l\k-l + GkUk, 
or the ExKF prediction step (7.17) 

X k\k-1 = f k( X k-l\k-li U k)' 


The matrix X k \ k _ 1 has as its columns the ensemble members dY k _ 1 for 

e = 1 

Correction. The correction step for the EnKF uses a trick called data repli- 
cation : the observation y k = H k x k + rj k is replicated into an m x E matrix 

D = [d (1) , . . . d (e) := y k + 7? (e) ~ A7(0, ii^)- 

so that each column d ^ consists of the actual observed data vector y k G y 
plus a perturbation that is an independent random draw from J\f(0,R k ). If 
the columns of X k \ k -i are a sample from the prior distribution, then the 
columns of 

X k \k-i + K k (D — H k X k |fe_i) 

form a sample from the posterior probability distribution, in the sense of a 
Bayesian prior (before data) and posterior (conditioned upon the data). The 
EnKF approximates this sample by replacing the exact Kalman gain (7.14) 


K k := Cu^HpHuC^Hl + R k ) 


-1 


which involves the covariance C k \ k _ i, which is not tracked in the EnKF, by 
an approximate covariance. The empirical mean and empirical covariance of 
X k \ k —\ are 


X k \ k ~i) 




1 E 
-Y 


x 


(e) 


ft Z—/ ^k\k-l 

e—1 


{X k \ k -\ ~ (X k \ k _i)) {X k \k-1 ~ (X k \k-l)) 

E — 1 


where, by abuse of notation, {X k \k-i) stands both for the vector in X that 

is the arithmetic mean of the E columns of the matrix X k \k-i and also for 
the matrix in X E that has that vector in every one of its E columns. The 
Kalman gain for the EnKF uses C k ^ k _ 1 in place of C k \ k ~ k 

Kk ■= C^HpHkC^Hi + R k ) 


1 


(7.21) 
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so that the correction step becomes 

Xk\k : = Xk\k-i + Kk {D — HkX k \ k -i). (7.22) 

One can also use sampling to dispense with Rk , and instead use the empirical 
covariance of the replicated data, 

(D - (D))(D - (D)Y 

E- 1 

Note, however, that the empirical covariance matrix is typically rank-deficient 
(in practical applications, there are usually many more state variables than 
ensemble members), in which case the inverse in (7.21) may fail to exist; in 
such situations, a pseudo-inverse may be used. 

Remark 7.4. Even when the matrices involved are positive-definite, instead 
of computing the inverse of a matrix and multiplying by it, it is much better 
(several times cheaper and also more accurate) to compute the Cholesky 
decomposition of the matrix and treat the multiplication by the inverse as 
solution of a system of linear equations. This is a general point relevant to 
the implementation of all KF-like methods. 

Remark 7.5. Filtering methods, and in particular the EnKF, can be used 
to provide approximate solutions to static inverse problems. The idea is that, 
for a static problem, the filtering distribution will converge as the number of 
iterations (‘algorithmic time’) tends to infinity, and that the limiting filtering 
distribution is the posterior for the original inverse problem. Of course, such 
arguments depend crucially upon the asymptotic properties of the filtering 
scheme; under suitable assumptions, the forward operator for the error can 
be shown to be a contraction, which yields the desired convergence. See, e.g., 
Iglesias et al. (2013) for further details and discussion. 


7.5 Bibliography 

The original presentation of the Kalman (Kalman, 1960) and Kalman-Bucy 
(Kalman and Bucy, 1961) filters was in the context of signal processing, and 
encountered some initial resistance from the engineering community, as rel- 
ated in the article of Humpherys et al. (2012). Filtering is now fully accepted 
in applications communities and has a sound algorithmic and theoretical 
base; for a stochastic processes point of view on filtering, see, e.g., the books 
of Jazwinski (1970) and 0ksendal (2003, Chapter 6). Boutayeb et al. (1997) 
and Ljung (1979) discuss the asymptotic properties of Kalman filters. 

The EnKF was introduced by Evensen (2009). See Kelly et al. (2014) for 
discussion of the well-posedness and accuracy of the EnKF, and Iglesias et al. 
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(2013) for applications of the EnKF to static inverse problems. The varia- 
tional derivation of the Kalman filter given here is based on the one given by 
Humpherys et al. (2012), which is also the source for Exercise 7.7. 

A thorough treatment of probabilistic forecasting using data assimila- 
tion and filtering is given by Reich and Cotter (2015). The article of Apte 
et al. (2008) provides a mathematical overview of data assimilation, with an 
emphasis on connecting the optimization approaches common in the data 
assimilation community (e.g. 3D-Var, 4D-Var, and weak constraint 4D-Var) 
to their Bayesian statistical analogues; this paper also illustrates some of the 
shortcomings of the EnKF. In another paper, Apte et al. (2007) also provide 
a treatment of non-Gaussian data assimilation. 


7.6 Exercises 

Exercise 7.1. Verify that the normal equations for the state estimation 
problem (7.4) have a unique solution. 

Exercise 7.2 (Fading memory). In the LKF, the current state variable is 
updated as the latest inputs and measurements become known, but the esti- 
mation is based on the least squares solution of all the previous states where 
all measurements are weighted according to their covariance. One can also 
use an estimator that discounts the error in older measurements leading to 
a greater emphasis on recent observations, which is particularly useful in 
situations where there is some modelling error in the system. 

To do this, consider the objective function 

J k\ fcVfc) : = \\\ x 0 - m 4%~ 1 + \Y. Xk ~ i \\ yi ~ R- 1 

2 = 1 

1 A 

+ 2 Xi ~ FiXi ~ 1 “ G i U i\\ 2 Q-^ 

2=1 

where the parameter A E [0, 1] is called the forgetting factor ; note that the 
standard LKF is the case A = 1, and the objective function increasingly 
relies upon recent measurements as A — >• 0. Find a recursive expression for 
the objective function and follow the steps in the variational derivation 
of the usual LKF to derive the LKF with fading memory A. 

Exercise 7.3. Write the prediction and correction equations (7.17)-(7.20) 
for the ExKF in terms of the Kalman gain. 

Exercise 7.4. Use the ExKF to perform position and velocity estimation 
for the Van der Pol oscillator 


x(t) — [i(l — x{f) 2 )x{t) + uj 2 x(t) = 0, 
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with natural frequency uj > 0 and damping fi > 0, given noisy observations 
of the position of the oscillator. (Note that y = 0 is the simple harmonic 
oscillator of Example 7.3.) 

Exercise 7.5. Building on Example 7.3 and Exercise 7.4, investigate the 
robustness of the Kalman filter to the forward model being ‘wrong’. Generate 
synthetic data using the Van der Pol oscillator, but assimilate these data using 
the LKF for the simple harmonic oscillator with a different value of uj. 

Exercise 7.6. Filtering can also be used to estimate model parameters, 
not just states. Consider the oscillator example from Example 7.3, but with 
an augmented state (x,v,uj). Write down the forward model, which is no 
longer linear. Generate synthetic position data using your choice of uj, then 
assimilate these data using the ExKF with an initial estimate for (xo,vo,uj) 
of large covariance. Perform the same exercise for the Van der Pol oscillator 
from Exercise 7.4. 

Exercise 7.7. This exercise considers the ExKF with noisy linear dynamics 
and noisy non-linear observations with the aim of trajectory estimation for a 
projectile. For simplicity, we will work in a two-dimensional setting, so that 
ground level is the line X 2 = 0, the Earth is the half-space < 0, and 
gravity acts in the (0,-1) direction, the acceleration due to gravity being 
g = 9.807 N/kg. Suppose that the projectile is launched at time to from 
Xo := t = (0, 0) m with initial velocity Vo := (300, 600) m/s. 

(a) Suppose that at time tk = kAt , At = 10“ 1 s, the projectile has position 
Xk G M 2 * and velocity v k G M 2 ; let X := (xk,Vk) G M 4 . Write down 
a 2 discrete-time forward dynamical model Fk G M 4x4 that maps Xk to 
Xk+i in terms of the time step At, and g. Suppose that the projectile 
has drag coefficient b = 10 -4 (i.e. the effect of drag is v = —bv). Suppose 
also that the wind velocity at every time and place is horizontal, and 
is given by mutually independent Gaussian random variables with mean 
10 m/s and standard deviation 5 m/s. Evolve the system forward through 
1200 time steps, and save this synthetic data. 

(b) Suppose that a radar site, located at a ground-level observation post 
o = (30, 0) km, makes measurements of the projectile’s position x (but 
not the velocity v) in polar coordinates centred on o, i.e. an angle of 
elevation <fi G [0°,90°] from the ground level, and a radial straight-line 
distance r > 0 from o to x. Write down the observation function h: X \-^ 
y := (</>, r), and calculate the derivative matrix H = D h of h. 

(c) Assume that observation errors in (0, r) coordinates are normally dis- 
tributed with mean zero, independent errors in the 0 and r directions, 
and standard deviations 5° and 500 m respectively. Using the synthetic 
trajectory calculated above, calculate synthetic observational data for 
times tk with 400 < k < 600 


2 There are many choices for this discrete-time model: each corresponds to a choice of 

numerical integration scheme for the underlying continuous-time ODE. 
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(d) Use the ExKF to assimilate these data and produce filtered estimates 

X k \ k of X k = (x k ,v k ). Use the observation (04ooU4oo) to initialize the 
position estimate with a very large covariance matrix of your choice; 
make and justify a similar choice for the initialization of the velocity est- 
imate. Compare and contrast the true trajectory, the observations, and 
the filtered position estimates. On appropriately scaled axes, plot norms 
of your position covariance matrices C k \k and the errors (i.e. the differ- 
ences between the synthetic ‘true’ trajectory and the filtered estimate). 
Produce similar plots for the true velocity and filtered velocity estimates, 
and comment on both sets of plots. 

(e) Extend your predictions both forward and backward in time to produce 
filtered estimates of the time and point of impact, and also the time 
and point of launch. To give an idea of how quickly the filter acquires 
confidence about these events, produce plots of the estimated launch and 
impact points with the mean ± standard deviation on the vertical axis 
and time (i.e. observation number) on the horizontal axis. 

Exercise 7.8. Consider, as a paradigmatic example of a nonlinear — and, 
indeed, chaotic — dynamical system, the Lorenz 63 ODE system (Lorenz, 
1963; Sparrow, 1982): 


x(t) = cr{y{t) - x(t)), 
y(t) = x(t)(p- z(t)) - y(t), 
z(t) = x(t)y(t ) - (3z(t), 

with the usual parameter values a = 10, /? = 8/3, and p = 28. 

(a) Choose an initial condition for this system, then initialize an ensemble 
of E = 1000 Gaussian perturbations of this initial condition. Evolve this 
ensemble forward in time using a numerical ODE solver. Plot histograms 
of the projections of the ensemble at time t > 0 onto the x-, y-, and z-axes 
to gain an impression of when the ensemble ceases to be Gaussian. 

(b) Apply the EnKF to estimate the evolution of the Lorenz 63 system, given 
noisy observations of the state. Comment on the accuracy of the EnKF 
predictions, particularly during the early phase when the dynamics are 
almost linear and preserve the Gaussian nature of the ensemble, and over 
longer times when the Gaussian nature breaks down. 


Chapter 8 

Orthogonal Polynomials and 
Applications 


Although our intellect always longs for clarity 
and certainty, our nature often finds uncer- 
tainty fascinating. 


On War 
Karl von Clausewitz 


Orthogonal polynomials are an important example of orthogonal decom- 
positions of Hilbert spaces. They are also of great practical importance: 
they play a central role in numerical integration using quadrature rules 
(Chapter 9) and approximation theory; in the context of UQ, they are also 
a foundational tool in polynomial chaos expansions (Chapter 11). 

There are multiple equivalent characterizations of orthogonal polynomials 
via their three-term recurrence relations, via differential operators, and other 
properties; however, since the primary use of orthogonal polynomials in UQ 
applications is to provide an orthogonal basis of a probability space, here 
L 2 -orthogonality is taken and as the primary definition, and the spectral 
properties then follow as consequences. 

As well as introducing the theory of orthogonal polynomials, this chapter 
also discusses their applications to polynomial interpolation and approxima- 
tion. There are many other interpolation and approximation schemes beyond 
those based on polynomials — notable examples being splines, radial basis 
functions, and the Gaussian processes of Chapter 13 — but this chapter 
focusses on the polynomial case as a prototypical one. 

In this chapter, M := No or {0, 1, ... , N} for some iV G No. For simplicity, 
we work over R instead of C, and so the L 2 inner product is a symmetric 
bilinear form rather than a conjugate- symmetric sesquilinear form. 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-8 
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8 Orthogonal Polynomials and Applications 


8.1 Basic Definitions and Properties 


Recall that a real polynomial is a function p : R -a R of the form 

p(x) = c 0 + c\x H b c n _ix n_1 + c n x n , 

where the coefficients co,...,c n G R are scalars. The greatest n G No for 
which c n 7 ^ 0 is called the degree of p, deg(p); sometimes it is convenient to 
regard the zero polynomial as having degree —1. If deg(p) = n and c n = 1, 
then p is said to be monic. The space of all (real) polynomials in x is denoted 
*}3, and the space of polynomials of degree at most n is denoted *}3< n . 

Definition 8.1. Let /i be a non- negative measure on R. A family of polyno- 
mials Q = {q n | n G A f} C ^3 is called an orthogonal system of polynomials 
if, for each n G A /, deg(g n ) = n, q n G L 2 (M, /x), and 


(qm,qn)L2(Li) •= / (^)< 7 n (^) d/i(x) = 0 m , n G Af are distinct. 


That is, (qm, qn) l 2 ( 11 ) = 7 n^mn for some constants 


In Ikn Hl 2 ^) 



called the normalization constants of the system Q. To avoid complications 
later on, we require that the normalization constants are all strictly positive. 
If y n = 1 for all n G A/”, then Q is an orthonormal system. 

In other words, a system of orthogonal polynomials is nothing but a col- 
lection of non-trivial orthogonal elements of the Hilbert space L 2 (M, fi) that 
happen to be polynomials, with some natural conditions on the degrees of the 
polynomials. Note that, given /x, orthogonal (resp. orthonormal) polynomials 
for fi can be found inductively by using the Gram-Schmidt orthogonalization 
(resp. orthonormalization) procedure on the monomials 1 In prac- 

tice, however, the Gram-Schmidt procedure is numerically unstable, so it is 
more common to generate orthogonal polynomials by other means, e.g. the 
three-term recurrence relation (Theorem 8.9). 

Example 8.2. (a) The Legendre polynomials Le n (also commonly denoted 
by P n in the literature), indexed by n G No, are orthogonal polynomials 
for uniform measure on [— 1 , 1 ]: 


Le m (x)Le n (x) <Zx $mn 

i Zn + 1 


(b) The Legendre polynomials arise as the special case a = (3 = 0 of the 
Jacobi polynomials , defined for a, /? > —1 and indexed by n G Nq. 
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The Jacobi polynomials are orthogonal polynomials for the beta distri- 
bution (1 — x) a (l — x)P dx on [— 1 , 1 ]: 



p(oi,(3)( \ pO,/3) ( T \ A t 

1 777, J J. yj y 'Ay J 


2«+d+i j r( n _|_ a _|_ i)p( n _g /j _|_ i) 
n!( 2 n + a + /? + l)T(n + a + /? + 1) mn ’ 


where T denotes the gamma function 



s* 1 e s ds 


= (t-l)! 


if t G N. 


(c) Other notable special cases of the Jacobi polynomials include the Cheby- 
shev polynomials of the first kind T n , which are the special case 
a = /3 = — and the Chebyshev polynomials of the second kind U n , which 
are the special case a = /3 = The Chebyshev polynomials are inti- 
mately connected with trigonometric functions: for example, 


T n (x) = cos(n arccos(x)) for |x| < 1 , 

and the n roots of T n are Zj := cos (§~~) for j = 1, . . . , n. 

(d) The ( associated ) Laguerre polynomials La^ , defined for a > —1 and ind- 
exed by n G Nq, are orthogonal polynomials for the gamma distribution 
x a e~ x dx on the positive real half-line: 



La^ (x)La^ (x)x a e x dx 


r(i + tt + w) 

, d mn . 

n\ 


The polynomials La n := La^ are known simply as the Laguerre polyno- 
mials. 

(e) The Hermite polynomials He n , indexed by n E No, are orthogonal poly- 
nomials for standard Gaussian measure 7 := (2'K)^ 1 ' 2 e^ x ' 2 dx on R: 



He m (x)He n (x) 


exp(— x 2 / 2 ) 

\J 2 ^ 


dx = n\S 


mn • 


Together, the Jacobi, Laguerre and Hermite polynomials are known as 
the classical orthogonal polynomials. They encompass the essential features 
of orthogonal polynomials on the real line, according to whether the (abso- 
lutely continuous) measure fi that generates them is supported on a bounded 
interval, a semi-infinite interval, or the whole real line. (The theory of ort- 
hogonal polynomials generated by discrete measures is similar, but has some 
additional complications.) The first few Legendre, Hermite and Chebyshev 
polynomials are given in Table 8.1 and illustrated in Figure 8.1. See Tables 8.2 
and 8.3 at the end of the chapter for a summary of some other classi- 
cal systems of orthogonal polynomials corresponding to various probability 
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n 

L e n (x) 

He n (T) 

T n (x) 

0 

1 

1 

1 

1 

X 

X 

X 

2 

\{3x 2 — 1) 

x 2 — 1 

2x 2 — 1 

3 

|(5x 3 — 3x) 

3 O nn 

4/ O 4y 

4x 3 — 3x 

4 

|(35x 4 - 30x 2 + 3) 

x 4 — 6x 2 + 3 

8x 4 — 8x 2 + 1 

5 

|(63x 5 - 70x 3 + 15x) 

x 5 — 10x 3 + ldx 

16x 5 — 2dx 3 + 5x 


Table 8.1: The first few Legendre polynomials Le n , which are orthogonal poly- 
nomials for uniform measure dx on [—1,1]; Hermite polynomials He n , which 
are orthogonal polynomials for standard Gaussian measure (27r) -1 / 2 e -x / 2 dx 
on R; and Chebyshev polynomials of the first kind T n , which are orthogonal 
polynomials for the measure (1 — £ 2 ) -1 / 2 dx on [—1, 1]. 


measures on subsets of the real line. See also Figure 8.4 for an illustration of 
the Askey scheme, which classifies the various limit relations among families 
of orthogonal polynomials. 

Remark 8.3. Many sources, typically physicists’ texts, use the weight func- 
tion e~ x dx instead of probabilists’ preferred ( [27r)~ 1 ^ 2 e~ x / 2 dx or e~ x ' 2 dx 
for the Hermite polynomials. Changing from one normalization to the other 
is not difficult, but special care must be exercised in practice to see which 
normalization a source is using, especially when relying on third-party soft- 
ware packages. 1 To convert integrals with respect to one Gaussian measure 
to integrals with respect to another (and hence get the right answers for 
Gauss-Hermite quadrature), use the following change-of- variables formula: 



It follows from this that conversion between the physicists’ and probabilists’ 
Gauss-Hermite quadrature formulae (see Chapter 9) is achieved by 



Existence of Orthogonal Polynomials. One thing that should be imme- 
diately obvious is that if the measure fi is supported on only N E N points, 
then dimL 2 (M, fi) = TV, and so (i admits only N orthogonal polynomials. 


1 For example, the GAUSSQ Gaussian quadrature package from http://netlib.org/ uses 

the physicists’ e x dx normalization. The numpy .polynomial package for Python provides 

separate interfaces to the physicists’ and probabilists’ Hermite polynomials, quadrature 

rules, etc. as numpy .polynomial .hermite and numpy .polynomial .hermite_e respectively. 
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Legendre polynomials, Le n , on [-1, 1], 

b 



Hermite polynomials, He n , on R. 

C 



Chebyshev polynomials of the first kind, T n , on [—1,1]. 

Fig. 8.1: The Legendre, Hermite and Chebyshev polynomials of degrees 0 
(black, dotted), 1 (black, dashed), 2 (black, solid), 3 (grey, dotted), 4 (grey, 
dashed) and 5 (grey, solid). 
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This observation invites the question: what conditions on fi are necessary in 
order to ensure the existence of a desired number of orthogonal polynomials 
for fit Recall that a matrix A is called a Hankel matrix if it has constant anti- 
diagonals, i.e. if aij depends only upon i + j. The definiteness of L 2 {/i) inner 
products, and hence the existence of orthogonal polynomials, is intimately 
connected to determinants of Hankel matrices of moments of the measure (i\ 

Lemma 8.4. The L 2 (/i) inner product is positive definite on ty<d if and only 
if the Hankel determinant det(iJ n ) is strictly positive for n = 1, . . . , d + 1, 
where 



ra 0 

m i 

m n —i 

mi 

rn 2 • • 

m n 

TTl n — 1 

m n ■ • 

m 2n -2 


m n 


x n dfj,(x). 


( 8 . 1 ) 


Hence , the L 2 (/r) inner product is positive definite on if and only if, for 
all n £ N, 0 < det(iJ n ) < oo. 


Proof. Let p(x) := CdX d + • • • + c\x + Cq G ^<d be arbitrary. Note that 


blliv) 


E 

kd = 0 


CkClX 




d fi{x) 


d 

k /= 0 


and so \\p I \l 2 (p) € (o , oo) if and only if Hd + 1 is a positive-definite matrix. By 
Sylvester’s criterion, this is LG+i is positive definite if and only if det(iJ n ) G 
(0, oo) for n = 1 , 2, . . . , d + 1 , which completes the proof. □ 

Theorem 8.5. If the L 2 {fi) inner product is positive definite on then 
there exists an infinite sequence of orthogonal polynomials for p. 

Proof. Apply the Gram-Schmidt procedure to the monomials x n , n G No- 
That is, take qo(x) = 1, and for n G N recursively define 


q n (x) := x n 


n — 1 


E 

k=0 


(x n ,qk) 
( Qk,qk ) 


qk{x). 


Since the inner product is positive definite, (qk , Qk ) > 0, and so each q n is 
uniquely defined. By construction, each q n is orthogonal to qk for k < n. □ 

By Exercise 8.1, the hypothesis of Theorem 8.5 is satisfied if the measure p 
has infinite support and all polynomials are /i-integrable. For example, there 
are infinitely many Legendre polynomials because polynomials are bounded 
on [—1,1], and hence integrable with respect to uniform (Lebesgue) measure; 
polynomials are unbounded on R, but are integrable with respect to Gaussian 
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measure by Fernique’s theorem (Theorem 2.47), so there are infinitely many 
Hermite polynomials. In the other direction, there is the following converse 
result: 

Theorem 8.6. If the L 2 (p) inner product is positive definite on ^<d, but 
not on ^p< n for any n > d, then p admits only d + 1 orthogonal polynomials. 


Proof. The Gram-Schmidt procedure can be applied so long as the denom- 
inators (qk^Qk) are strictly positive and finite, i.e. for k < d + 1. The poly- 
nomial qd+i is orthogonal to q n for n < d; we now show that qd+i = 0. By 
assumption, there exists a polynomial p of degree d+1, having the same 
leading coefficient as qd+i , such that ||p||l 2 (^) is 0, oo, or even undefined; 
for simplicity, consider the case ||p||l 2 (/t) = 0, as the other cases are similar. 
Hence, p — qd+i has degree d, so it can be written in the orthogonal basis 
{<?o, •••,<&} as 

d 

P - Qd + 1 = C Mk 

k = 0 

for some coefficients cq, . . . , q. Hence, 


d 

o = INIi 2 ( M ) = lkd+illi 2 o+ + YA\Mh<*> 

k=0 


which implies, in particular, that \\qd+i \\l 2 (h) = 0. Hence, the normalization 
constant 7^+1 = 0, which is not permitted, and so qd+i is not a member of a 
sequence of orthogonal polynomials for p. □ 


Theorem 8.7. If p has finite moments only of degrees 0,1 , . . . ,r, then p 
admits only a finite system of orthogonal polynomials qo, ... ,qd, where d is 
the minimum of \r / 2j and #supp (p) — 1. 

Proof. Exercise 8.2. □ 


Completeness of Orthogonal Polynomial Bases. A subtle point in the 
theory of orthogonal polynomials is that although an infinite family Q of 
orthogonal polynomials for p forms an orthogonal set in L 2 (M, /r), it is not 
always true that Q forms a complete orthogonal basis for L 2 (M, /x), i.e. it is 
possible that 

span Q C L 2 (M, p). 

Examples of sufficient conditions for Q to form a complete orthogonal basis 
for L 2 (M, p) include finite exponential moments (i.e. Ex~/i[exp(a|X|)] is finite 
for some a > 0), or the even stronger condition that the support of p is a 
bounded set. See Ernst et al. (2012) for a more detailed discussion, and see 
Exercise 8.7 for the construction of an explicit example of an incomplete but 
infinite set of orthogonal polynomials, namely those corresponding to the 
probability distribution of a log-normal random variable. 
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8.2 Recurrence Relations 

An aesthetically pleasing fact about orthogonal polynomials, and one that is 
of vital importance in numerical methods, is that every system of orthogonal 
polynomials satisfies a three-term recurrence relation of the form 

Qn-\- i(*r) — (A n x d - B n ^q n (x^ C n q n —\(pc) (8.2) 

for some sequences (A n ), (L> n ), (C n ), with the initial terms qo(x) = 1 and 
q- i(x) = 0. There are many variations in the way that this three-term rec- 
urrence is presented: another one, which is particularly commonly used for 
orthogonal polynomials arising from discrete measures, is 

xqn(%) — A n q n j r \{xh) (A n T C n ^)q n (x') T C n q n _ i(t) (8.3) 

and in Theorem 8.9 we give the three-term recurrence for the monic orthog- 
onal polynomials associated with a measure p. 

Example 8.8. The Legendre, Hermite and Chebyshev polynomials satisfy 
the recurrence relations 


he n+ i(x) = — -xLe n (x) — Le„_i(a;), 

n + 1 n + 1 

He„ + i(x) = xRe n (x) - nHe„_i(x), 

T n+ i(x) = 2 xT n (x) - T n -i(x). 


These relations can all be verified by direct substitution and an integration 
by parts with respect to the appropriate generating measure p on R. The 
Jacobi polynomials also satisfy the three-term recurrence (8.2) with 


(2 n + 1 + a + /3)(2n + 2 + a + /3) 

2 (n + 1 )(n + 1 + a + /3) 

( a 2 — (3 2 )(2n + 1 + a + f3) 

2 (n + l)(2n + a + /3)(n + 1 + a + /3) 
{n -j- a)(n T /3^j(f2n 2 cx, -\- /3) 

(n -f- l)(?r -f 1 d ft /3^(2n — f- o: — |— /3) 


(8.4) 


The coefficients for the three-term recurrence relation are determined 
(up to multiplication by a constant for each degree) by the following theorem, 
which gives the coefficients for the monic orthogonal polynomials associated 
with a measure p: 

Theorem 8.9. Let Q = {q n \ n E A f} be the monic orthogonal polynomials 
for a measure p. Then 
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where 


q„+i(x) = {x- a n )q n (x) - fi n q n -i(x), 
q 0 (x) = 1, 
q- i(x) = 0, 


ct n 

Pn 

Po 


( XQm Qn) L 2 (fi) 
(Qm Qn) L 2 (/i) 

(Qm Qn) L 2 (h) 
{Qn — R Qn— l) L 2 (n) 


{Qo i Qo) L 2 (/i) = / d/x. 


for n > 0, 
/or n > 1, 


Hence, the orthonormal polynomials {p n \ n E A/"} for g satisfy 


(8.5) 


V^WiPn+iOO = (a; - a„)p„(a;) - yTPn- i(x), (8.6) 

Po(d = A7 1/2 , 

P-ip) = 0. 

Proof. First, note that the L 2 * inner product 2 satisfies the shift property 


(xf,g)L 2 (v) = {f,xg) L 2 (v) ( 8 - 7 ) 

for all /, g : R -T R for which either side exists. 

Since deg(g n+ i — xq n ) < n, it follows that 


n — 2 

<7n+i(d - aa^Or) = -a n q n (x) - f3 n q n -i(x) + ^ c nj qj(x) (8.8) 

j=0 

for suitable scalars a n , /3 n and c nj . Taking the inner product of both sides of 
(8.8) with q n yields, by orthogonality, 

~ { x Qm Qn) L 2 (/j,) = ~^n{Qm Qn) L 2 (/j,)i 

so that a n = (xq n , q n ) l 2 (fi) / (Qm q n ) L 2 ( / i) as claimed. The expression for f3 n is 
obtained similarly, by taking the inner product of (8.8) with q n -i instead of 
with q n : 

((Zn+l — %Qm Qn — l)z, 2 (/i) = — {xq n (x) ? Qn—l) L 2 (/i) = — fin {Qn— 1 > Qn — l) L 2 (fi) ? 


2 The Sobolev inner product, for example, does not satisfy the shift property (8.7). Hence, 

the recurrence theory for Sobolev orthogonal polynomials is more complicated than the 

L 2 case considered here. 
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and so 


, (Qrn %Qn — i)l 2 (/i) 

'n = T] \ 

\Qn — li Qn—l/L 2 (/i) 

Qn "1“ ?~)l 2 (/i) 

(Qn— 1? Qti—i)l 2 (/i) 

Qn} L 2 ( ii ) 

(Qn — 1? <7n— l).L 2 (/i) 


with deg(r) < n 


since <? n JL *P< n _i. 


Finally, taking the inner product of (8.8) with gj for j < n — 1 yields 


(xQm Qj) L 2 (/i) — c nj (Qj i Qj) L 2 (/i) • (8*9) 

It follows from the shift property (8.7) that (xq n , q^L 2 ^) = (Qm^Qj) l 2 (»- 
Since deg(x</j) < n — 1, it follows that the left-hand side of (8.9) vanishes, so 
c n j = 0, and the recurrence relation (8.5) is proved. □ 

Furthermore, there is a converse theorem, which provides a characteriza- 
tion of precisely which three-term recurrences of the form (8.5) arise from 
systems of orthogonal polynomials: 

Theorem 8.10 (Favard, 1935). Let ( a n ) n and ( /3 n ) n eAf be real sequences 
and let Q = {q n \ n G Af} be defined by the recurrence 

Qn+i(x) = (x + a n )q n {x) - f3 n q n - i(x), 
qo(x) = 1, 
q-i(x) = 0. 


Then Q is a system of monic orthogonal polynomials for some non-negative 
measure p if and only if ‘ for all n G Af, a n ^ 0 and /3 n > 0. 

The proof of Favard’s theorem will be omitted here, but can be found in, 
e.g., Chihara (1978, Theorem 4.4). 

A useful consequence of the three-term recurrence is the following formula 
for sums of products of orthogonal polynomial values at any two points: 

Theorem 8.11 (Christoffel-Darboux formula). The orthonormal polynomi- 
als {p n | n G A f} for a measure p satisfy 

(8.io) 

V X 

k = 0 u 

and 

n 

£k(d| 2 = \/K+l(Pn+l( X )Pn(x) ~ p' n (x)p n+1 (x)) . (8.11) 

k = 0 
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Proof. Multiply the recurrence relation (8.6), i.e. 

y/ ($k+\Pk+l{x) = (x- a k )Pk(x ) - \fWkVk- l(x), 

by pk(y) on both sides and subtract the corresponding expression with x and 
y interchanged to obtain 

(y - x)p k (y)p k (x) = y/ Pk+i(pk+i(y)Pk(x) - pk(y)pk+i(x)) 

- Vlh(pk(y)Pk- 10 ) - Pk-i(y)pk(x)) . 

Sum both sides from k = 0 to k = n and use the telescoping nature of the sum 
on the right to obtain (8.10). Take the limit as y x to obtain (8.11). □ 

Corollary 8.12. The orthonormal polynomials {p n \ n E AT} for a measure 
fi satisfy 

Pn+l( x )Pn(x) ~ p n (x)p n+ i(x) > 0. 

Proof. Since (3 n > 0 for all n, (8.11) implies that 

p' n+ l(x)p n (x) -p’ n (x)p n+ i(x) > 0 , 

with equality if and only if the sum on the left-hand side of (8.11) vanishes. 
However, since 

n 

T, \Pk(x )\ 2 > \po(x )\ 2 = f 3 p > 0, 

k=0 

the claim follows. □ 


8.3 Differential Equations 

In addition to their orthogonality and recurrence properties, the classical 
orthogonal polynomials are eigenfunctions for second-order differential oper- 
ators. In particular, these operators take the form 

c=Q{x) h +L{x) b 

where Q e ^<2 is quadratic, L e ^3<i is linear, and the degree-n orthogonal 
polynomial q n satisfies 

(Cq n )(x) = Q(x)q'n(x) + L(x)q' n (x) = A n q n (x), (8.12) 

where the eigenvalue is 

n — 1 

An = n(—Q" + L'). (8.13) 

Note that it makes sense for Q to be quadratic and L to be linear, since then 
(8.12) is an equality of two degree-n polynomials. 
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Example 8.13. (a) The Jacobi polynomials satisfy CPn**’^ = A n Pn*^\ 
where 


jC — (1 — 0? 2 )“j — y H - (/? — CK — (ck H - /? H - 2)x)— — , 
dor ax 

A n := — n{n H- ol T- [3 + 1). 

(b) The Hermite polynomials satisfy £He n = A n He n , where 

c . = ^_ A 

da ; 2 da; ’ 

An . — Ti. 


(c) The Laguerre polynomials satisfy £La^ = A n La^ a \ where 

d 2 d 

L := .X-— r - (1 + a - x) — , 
dor do; 

A n • — ti- 


lt is not difficult to verify that if Q = {<2 n | n E A/"} is a system of monic 
orthogonal polynomials, which therefore satisfy the three-term recurrence 


Qn+i(x) = (x - OL n )q n {x) - f3 n q n -i(x) 


from Theorem 8.9, then q n is an eigenfunction for C with eigenvalue A n as 
(8.12)-(8.13): simply apply the three-term recurrence to the claimed equation 
Cq n + i = An+i^n+i and examine the highest-degree terms. What is more dif- 
ficult to show is the converse result (which uses results from Sturm-Liouville 
theory and is beyond the scope of this text) that, subject to suitable con- 
ditions on Q and L, the only eigenfunctions of C are polynomials of the 
correct degrees, with the claimed eigenvalues, orthogonal under the measure 
d fi = w(x) day where 


w(x) oc 


7WT exp 

Q(x) 



Furthermore, the degree- n orthogonal polynomial q n is given by Rodrigues’ 
formula 


q n (x) oc 


1 d n 
w(x) dx n 


(■ w(x)Q(x ) n ). 


(Naturally, w and the resulting polynomials are only unique up to choices 
of normalization.) For our purposes, the main importance of the differential 
properties of orthogonal polynomials is that, as a consequence, the conver- 
gence rate of orthogonal polynomial approximations to a given function / 
is improved when / has a high degree of differentiability; see Theorem 8.23 
later in this chapter. 
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8.4 Roots of Orthogonal Polynomials 

The points x at which an orthogonal polynomial q n (x) = 0 are its roots , or 
zeros , and enjoy a number of useful properties. They play a fundamental role 
in the method of approximate integration known as Gaussian quadrature, 
which will be treated in Section 9.2. 

The roots of an orthogonal polynomial can be found as the eigenvalues of 
a suitable matrix: 

Definition 8.14. The Jacobi matrix of a measure (i is the infinite, symmet- 
ric, tridiagonal matrix 


o 0 \ffdi 0 


*Go(aO • — 


Hi VTh 
0 y /] 3 2 02 


where ak and /?/ c are as in Theorem 8.9. The upper-left nxn minor of Joo(/i) 
is denoted J n (/x). 

Theorem 8.15. Let po, Pi, • • • be the orthonormal polynomials for /i. The 
zeros of p n are all real, are the eigenvalues of J n (fT), and the eigenvector of 
J n (/i) corresponding to the zero of p n at z is 


p(z) 


po(z) 

Pn-l(z) 


Proof. Let p(x) := \po(x ), . . . ,p n -i(x)] J as above. Then the first n recur- 
rence relations for the orthonormal polynomials, as given in Theorem 8.9, 
can be summarized as 


xp{x) = J n {p)p{x) + \ZlViP n (x)[ 0, ... ,0, 1] T . (8.14) 

Now let x = z be any zero of p n . Note that p(z) 7^ [0, . . . , 0] T , since p(z) has 
1/V36 as its first component po(z). Hence, (8.14) immediately implies that 
p(z) is an eigenvector of J n {h) with eigenvalue z. Finally, since J n (/x) is a 
symmetric matrix, its eigenvalues (the zeros of p n ) are all real. □ 

All that can be said about the roots of an arbitrarily polynomial p of 
degree n is that, by the Fundamental Theorem of Algebra, p has n roots in C 
when counted with multiplicity. Since the zeros of orthogonal polynomials are 
eigenvalues of a symmetric matrix (the Jacobi matrix), these zeros must be 


146 


8 Orthogonal Polynomials and Applications 


real. In fact, though, orthogonal polynomials are guaranteed to have simple 
real roots, and the roots of successive orthogonal polynomials alternate with 
one another: 

Theorem 8.16 (Zeros of orthogonal polynomials). Let p be supported in a 
non- degenerate interval /CM, and let Q = {q n \ n £ A f} be a system of 
orthogonal polynomials for p 

(a) For each n £ Af , q n has exactly n distinct real roots z[ n \ . . . , Zn £ I . 

(b) //(a, b) is an open interval of p -measure zero , then (a, b) contains at most 
one root of any orthogonal polynomial q n for p. 

(c) The zeros z\ n ^ of q n and z\ n+1 ^ of q n + 1 alternate: 


z 


(n+l) 


< z\ n) < z 


(n+l) 


< 


< Z 


(n+l) 


n 


< < z 


(n+l) e 
n+l 5 


hence, whenever m > n, between any two zeros of q n there is a zero of q m . 

Proof, (a) First observe that (q n , 1 = 0, and so q n changes sign in I. 

Since q n is continuous, the intermediate value theorem implies that q n 

(n) 

has at least one real root z\ £ I. For n > 1, there must be another root 
Z 2 £ / of q n distinct from z[ n \ since if q n were to vanish only at z[ n \ 
then {x — z[ n ^q n would not change sign in /, which would contradict the 
orthogonality relation (x — z[ n \ q n )L 2 (ii ) = 0. Similarly, if n > 2, consider 
(x — z[ n ^ [x — ^2 )^n 1° deduce the existence of yet a third distinct root 

z 3^ £ I. This procedure terminates when all the n complex roots of q n 
are shown to he in I. 

(b) Suppose that (a, b) contains two distinct zeros z\ n ^ and z^ of q n . Let 
a n 0 denote the coefficient of x n in q n (x). Then 



> 0 , 


since the integrand is positive outside of (a, b). However, this contradicts 
the orthogonality of q n to all polynomials of degree less than n. 

(c) As usual, let p n be the normalized version of q n . Let cr, r be consecutive 
zeros of p n , so that p' n (o-)p' n (r) < 0. Then Corollary 8.12 implies that 
p n+ 1 has opposite signs at a and r, and so the IVT implies that p n +i 
has at least one zero between a and r. This observation accounts for 


8.5 Polynomial Interpolation 


147 


n 


— 1 of the n + 1 zeros of p n + i, namely z . 2 +1 < • • • < zi n+1 \ There 


are two further zeros of p n + i, one to the left of z ± and one to the right 


of Zn • This follows because > 0, so Corollary 8.12 implies that 


.+) 


Pn+i(zn ) < 0- Since p n +i(x) — > Too as x ^ Too, the IVT implies the 
existence of > zi?\ A similar argument establishes the existence 


of z{ n+1 ^ < z 


'n+1 
(n) 


□ 


8.5 Polynomial Interpolation 


The existence of a unique polynomial p{pc) = CiX% degree at most n 

that interpolates the values yo, ... ,y n G R at n + 1 distinct points xq, ... , 
x n G R follows from the invertibility of the Vandermonde matrix 



1 £o 
1 2 ^ 


^ j^(n+l) X (n+1) 


(8.15) 


1 Y‘ V * *— 




n 

n_ 


and hence the unique solvability of the system of simultaneous linear equations 


Vn ( Xq , . . . , x n ) 


Co 


2/o 

. 


. 

Cn 


_2M_ 


(8.16) 


In practice, a polynomial interpolant would never be constructed in this way 
since, for nearly coincident nodes, the Vandermonde matrix is notoriously 
ill-conditioned: the determinant is given by 


det(V„) = JJ (xi - Xj) 

0<i<j<n 


and, while the condition number of the Vandermonde matrix is hard to cal- 
culate exactly, there are dishearteningly large lower bounds such as 




n,oo 


IIP 


n oo—^oo 


IP, 


-1 

n 


oo — )-oo 


> 2 n / 2 


(8.17) 


for sets of nodes that are symmetric about the origin, where 



oo— ^oo 


sup {|| V n 


X\ 


oo 


x G 


>n+ 1 


\X\ 


oo 


1} 


denotes the matrix (operator) norm on R( n+1 ) x ( n+1 ) induced by the oo-norm 
on M n+1 (Gautschi and Inglese, 1988). 
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However, there is another — and better-conditioned — way to express 
the polynomial interpolation problem, the so-called Lagrange form , which 
amounts to a clever choice of basis for *P< n (instead of the usual monomial 
basis (i n }) so that the matrix in (8.16) in the new basis is the 

identity matrix. 

Definition 8.17. Let xo, • • • , x n G R be distinct. The associated nodal poly- 
nomial is defined to be 


J (x x j ) G 

3 = 0 


For 0 < j < n, the associated Lagrange basis polynomial ij G < n is 
defined by 


hh) : = n 

0<k<n 


X ~ X k 
Xj Xk 


Given also arbitrary values yo, ... ,y n G R, the associated Lagrange interpo- 
lation polynomial is 

n 

L(x):= &(?) • 

3=0 


Theorem 8.18. Given distinct xo, . . . , x n G R and any yo, ... ,y n G R, the 
associated Lagrange interpolation polynomial L is the unique polynomial of 
degree at most n such that L(x k ) = yr for k = 0 , . . . , n. 


Proof. Observe that each Lagrange basis polynomial *3 G *P< n , and so 
L G ^3< n . Observe also that £j(x k ) = djk- Hence, 


n n 

L(x k ) "y ^ yj £j (xfc ) ^ ^ ijj Sj fc y k . 

j = o i=o 

For uniqueness, consider the basis • • • , Gi} of ^3< n and suppose that p = 
Y^j=o c j^j i s any polynomial that interpolates the values {yk} k = o the points 
{xk} k= o- But then, for each k = 0, . . . , n, 

n n 

Vk ^ ^ Oj Gj (Xk ) ^ y Cjdjk Gc? 

i=0 j=o 


and so p = L, as claimed. □ 

Runge’s Phenomenon. Given the task of choosing nodes x k G [a, 6] 
between which to interpolate functions / : [a, 6] -A R, it might seem natural 
to choose the nodes x k to be equally spaced in [a, b\. Runge (1901) famously 
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showed that this is not always a good choice of interpolation scheme. Consider 
the function /: [— 1, 1] — > R defined by 


f{x) 


1 

1 + 25x 2 ’ 


(8.18) 


and let L n be the degree-n (Lagrange) interpolation polynomial for / on 
the equally spaced nodes Xk := — 1. As illustrated in Figure 8.2(a), L n 
oscillates wildly near the endpoints of the interval [—1, 1]. Even worse, as n 
increases, these oscillations do not die down but increase without bound: it 
can be shown that 


lim sup \f(x) — L n (x) 
n ^°° xe[-i,i] 


= oo. 


As a consequence, polynomial interpolation and numerical integration using 
uniformly spaced nodes — as in the Newton-Cotes formula (Definition 9.5) — 
can in general be very inaccurate. The oscillations near ±1 can be controlled 
by using a non-uniform set of nodes, in particular one that is denser near ±1 
than near 0; the standard example is the set of Chebyshev nodes defined by 


Xk 


cos 




for k = 1 ,...,n, 


i.e. the roots of the Chebyshev polynomials of the first kind T n , which are 
orthogonal polynomials for the measure (1 — x 2 ) 1 / 2 dx on [—1,1]. As ill- 
ustrated in Figure 8.2(b), the Chebyshev interpolant of / shows no Runge 
oscillations. In fact, for every absolutely continuous function /: [—1,1] 

R, the sequence of interpolating polynomials through the Chebyshev nodes 
converges uniformly to /. 

However, Chebyshev nodes are not a panacea. Indeed, Faber (1914) showed 
that, for every predefined sequence of sets of interpolation nodes, there is a 
continuous function for which the interpolation process on those nodal sets 
diverges. For every continuous function there is a set of nodes on which the 
interpolation process converges. In practice, in the absence of guarantees of 
convergence, one should always perform ‘sanity checks’ to see if an interpo- 
lation scheme has given rise to potentially spurious Runge-type oscillations. 
One should also check whether or not the interpolant depends sensitively 
upon the nodal set and data. 

Norms of Interpolation Operators. The convergence and optimality of 
interpolation schemes can be quantified using the norm of the corresponding 
interpolation operator. From an abstract functional-analytic point of view, 
interpolation is the result of applying a suitable projection operator 77 to a 
function / in some space V to yield an interpolating function 77/ in some 
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a 



Uniformly spaced nodes. 


b 



Fig. 8.2: Runge’s phenomenon: the function f(x) := (1 + 25x 2 ) -1 is the 
heavy grey curve, and also shown are the degree-n polynomial interpolants 
of / through n nodes, for n = 6 (dotted), 10 (dashed), and 14 (solid). 


prescribed subspace U of V. For example, in the above discussion, given n + 1 
distinct nodes xq, . . . , x n , the interpolation subspace U is ^3< n and the oper- 
ator 77 is 

n 

n-. f h> 

2 = 0 

or, in terms of pointwise evaluation functionals (Dirac measures) S ai 47 = 
X^r=i Note, in particular, that 77 is a projection operator that acts as 

the identity function on the interpolation subspace, i.e. the degree-n poly- 
nomial interpolation of a polynomial p E *P< n is just p itself. The following 
general lemma gives an upper bound on the error incurred by any interpola- 
tion scheme that can be written as a projection operator: 

Lemma 8.19 (Lebesgue’s approximation lemma). Let (V, || • ||) be a normed 
space, U C V , and 77 : V -A 77 a linear projection onto 77 (i.e. for all u E 77, 
IJu = u) with finite operator norm || 77|| op • Then, for all v E V, 

||v — IIv\\ < (1 + || 77|| op ) inf ||f — u\\. (8.19) 

u£U 

Proof. Let £ A 0 be arbitrary, and let n* E 77 be £-suboptimal for the 
infimum on the right-hand side of (8.19), i.e. 

\\v — u*\\ < £ + inf |R — u 

u diA 


( 8 . 20 ) 
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Now 

\\v — IJv 


v — u*\\ 

+ 1| u* - nv\\ 


v — u*\\ 

+ \\IIu* — IIv\\ 

since LI\u = id u 

v — u*\\ 

T |17 o P ^ — 'cjl 

by definition of 77 | op 

(IT ||77| 

op) ^ ^ II 


(IT ||77| 

|o P ) inf \\v - u || T e(l T ||il|| 0 p) 

by (8.20). 


Since e > 0 was arbitrary, (8.19) follows. □ 

Thus, with respect to a given norm || • ||, polynomial interpolation is quasi- 
optimal up to a constant factor given by the operator norm of the interpola- 
tion operator in that norm; in this context, || 7T || op is often called the Lebesgue 
constant of the interpolation scheme. For the maximum norm, the Lebesgue 
constant has a convenient expression in terms of the Lagrange basis polyno- 
mials; see Exercise 8.10. The next section considers optimal approximation 
with respect to L 2 norms, which amounts to orthogonal projection. 


8.6 Polynomial Approximation 

The following theorem on the uniform approximation (on compact sets) of 
continuous functions by polynomials should hopefully be familiar: 

Theorem 8.20 (Weierstrass, 1885). Let [a, b\ cMfefl bounded interval, let 
f: [a,b\ -A R be continuous, and let £ > 0. Then there exists a polynomial p 
such that 

sup \f(x) — p(x)\ < £. 

a<x<b 

Remark 8.21. Note well that Theorem 8.20 only ensures uniform approxi- 
mation of continuous functions on compact sets. The reason is simple: since 
any polynomial of finite degree tends to Too at the extremes of the real line 
R, no polynomial can be uniformly close, over all of R, to any non-constant 
bounded function. 

Theorem 8.20 concerns uniform approximation; for approximation in mean 
square, as a consequence of standard results on orthogonal projection in 
Hilbert spaces, we have the following: 

Theorem 8.22. Let Q = {q n \ n E AT} be a system of orthogonal polynomials 
for a measure p on a subinterval I CR. For any f E L 2 (I , fi) and any d E No, 
the orthogonal projection LJ^f of f onto ty<d is the best degree-d polynomial 
approximation of f in mean square, i. e. 
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n d f = argmin || p - f || L 2 (/i) , 

P^<d 


where, denoting the orthogonal polynomials for p by {qk \ k > 0}, 


n d f := £ 

k=0 


(/, 9fe)i 2 ( M ) 

lioi.ll 2 qk ’ 
l|yfc|lL 2 ( M ) 


and the residual is orthogonal to the projection subspace: 


(f - n d f,p) L 2 (jl) = 0 for all p G ¥< d . 

An important property of polynomial expansions of functions is that the 
quality of the approximation (i.e. the rate of convergence) improves as the 
regularity of the function to be approximated increases. This property is 
referred to as spectral convergence and is easily quantified by using the ma- 
chinery of Sobolev spaces. Recall that, given k E No and a measure fi on a 
subinterval ICR, the Sobolev inner product and norm are defined by 




dTu d m v 


dx rn dx rn 


dfi 


The Sobolev space H k {X,ij) consists of all L 2 functions that have weak 
derivatives of all orders up to k in L 2 , and is equipped with the above inner 
product and norm. (As usual, we abuse terminology and confuse functions 
with their equivalence classes modulo equality /i-almost everywhere.) 

Legendre expansions of Sobolev functions on [—1, 1] satisfy the following 
spectral convergence theorem; the analogous result for Hermite expansions 
of Sobolev functions on R is Exercise 8.13, and the general result is Exercise 
8.14. 

Theorem 8.23 (Spectral convergence of Legendre expansions). There is a 
constant Cp> 0 that may depend upon k but is independent of d and f such 
that, for all f E H k {[— 1, l],dx), 

11/ - n d f\\ L 2 idx) < Ckd~ k \\f\\ H k( dx y (8.21) 


Proof. As a special case of the Jacobi polynomials (or by Exercise 8.11), the 
Legendre polynomials satisfy £Le n = A n Le n , where the differential operator 
C and eigenvalues A n are 




o d 
2x — , 
dx 



n(n + 1). 
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if/ G H k ([— 1, l],dx), then, by the definition of the Sobolev norm and the 
operator £, ||£/||l 2 < C ||/||#2 and, indeed, for any m G N such that 2m < k , 


ll^ m /||L2 <C||/||^. 

The key ingredient of the proof is integration by parts: 

(/, Le „) L 2 = A" 1 j (CLe n )(x)f(x) dx 

pi 

((1 — x 2 )L, e'^(x)f(x) — 2xLe n (x) f (x)) Ax 


( 8 . 22 ) 


= A 


-l 

n 


= —Xn 1 J (((1 — x 2 )f)'(x)Le' n (x) + 2xLe' n (x) f (x)) dx by IBP 

■ i 


— A n 1 J (1 — x 2 )f'(x) Le' n (x) dx 
K 1 [ ((1 - x 2 )f')\x)Le n (x)dx 


-l 


= K Le n ) L 2 . 

Hence, for all m G No for which / has 2m weak derivatives, 

(£ m /, Le n ) L 2 


(/ 5 Le n )^2 — 


X rn 


Hence, 


II f tt f ||2 _ Y' 

v-njh,- l iraifr 


n=d+l 

oo 

E 

n=d+l 


|(r m /,Le n ) L2 | : 

AM|Le " 2 


by (8.23) 


n II L 2 


oo 


< _J_ Y' |(£ m /, Le„) L 2 | 2 
- A 2m " 

n=d+l 


II Le 112 


n II L 2 


1 


< 0 
— \2rn 


oo 


E 


|(£™/,Le„) i2 | : 


d n =0 ll Le «H|2 


by IBP 


(8.23) 




< C 2 d~ 4m \\f\\ z H 2, 


by Parseval (Theorem 3.24) 
by (8.22) 


since |A^| > d 2 . Setting k = 2m and taking square roots yields (8.21). □ 
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Gibbs’ Phenomenon. However, in the other direction, poor regularity can 
completely ruin the nice convergence of spectral expansions. The classic 
example of this is Gibbs ; phenomenon , in which one tries to approximate 
the sign function 

{ —1, if x < 0, 

0, if x = 0, 

1, if x > 0, 

on [—1,1] by its expansion with respect to a system of orthogonal polynomials 
such as the Legendre polynomials Le n (x) or the Fourier polynomials e inx . The 
degree- (2 N + 1) Legendre expansion of the sign function is 

(n 2N+1 s g n) (x) = J2 - ( 8 - 24 ) 

See Figure 8.3 for an illustration. Although II 2 N +1 sgn -A sgn as N -a 00 
in the L 2 sense, there is no hope of uniform convergence: the oscillations 
at the discontinuity at 0, and indeed at the endpoints ±1, do not decay to 
zero as A A 00 . The inability of globally smooth basis functions such as 
Legendre polynomials to accurately resolve discontinuities naturally leads to 
the consideration of non-smooth basis functions such as wavelets. 

Remark 8.21 Revisited. To repeat, even though smoothness of / improves 
the rate of convergence of the orthogonal polynomial expansion 77 df -A / as 
d -A 00 in the L 2 sense, the uniform convergence and pointwise predictive 
value of an orthogonal polynomial expansion 77^/ are almost certain to be 
poor on unbounded (non-compact) domains, and no amount of smoothness 
of / can rectify this problem. 


8.7 Multivariate Orthogonal Polynomials 


For working with polynomials in d variables, we will use standard multi-index 
notation. Multi-indices will be denoted by Greek letters a = (on, . . . , o^) G 
Nq. For x = (#i, . . . , Xd) 6 and a G Nq, the monomial x a is defined by 


and |a| := ai + • • • + is called the total degree of x a . A polynomial is a 
function p : -a R of the form 


pin ■- Y, 


C(yX 


a 
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Fig. 8.3: Legendre expansions of the sign function on [—1,1] exhibit Gibbsian 
oscillations at 0 and at ±1. The sign function is shown as the heavy black; 
also shown are the Legendre expansions (8.24) to degree 2N — 1 for N = 5 
(dotted), 15 (dashed), and 25 (solid). 


for some coefficients c a E R. The total degree of p is denoted deg(p) and is 
the maximum of the total degrees of the non-trivial summands, i.e. 


deg(p) := max { \a 



The space of all polynomials in x\ , ,Xd is denoted while the subset 
consisting of those d - variate polynomials of total degree at most k is denoted 
These spaces of multivariate polynomials can be written as (direct sums 
of) - tensor products of spaces of univariate polynomials: 


= ^3 <g> • • • <8> 

\a\<k 


A polynomial that contains only terms of fixed total degree k E No, i.e. 
one of the form 

p(x) = ^2 c * xa 

\cx\—k 
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for coefficients c a E R, is said to be homogenous of degree k: p satisfies 
the homogeneity relation p(Xx) = A k p(x) for every scalar A. Homogeneous 
polynomials are useful in both theory and practice because every polynomial 
can be written as a sum of homogeneous ones, and the total degree provides 
a grading of the space of polynomials: 




d 


0 {p e y d 

keN o 


p is homogeneous of degree fc} . 


Given a measure p on R d , it is tempting to apply the Gram-Schmidt 
process with respect to the inner product 




f{x)g{x) dn(x) 


to the monomials {x a \ a G Nq} to obtain a system of orthogonal polyno- 
mials for the measure p. However, there is an immediate problem, in that 
orthogonal polynomials of several variables are not unique. In order to apply 
the Gram-Schmidt process, we need to give a linear order to multi-indices 
a E Nq. Common choices of ordering for multi-indices a (here illustrated for 
d — 2) include the lexicographic ordering 


a 

(0,0) 

(0,1) 

(0,2) 

(0,3) 


(m) 

(1,2) 



a 


0 

1 

2 

3 


2 

3 



which has the disadvantage that it does not respect the total degree |a|, and 
the graded reverse-lexicographic ordering 


a 

(0,0) 

(0,1) 

(1,0) 

(0,2) 

(u) 

(2,0) 

(0,3) 



a 


0 

1 

1 

2 

2 

2 

3 



which does respect total degree; the reversals of these orderings, in which 
one orders first by oq instead of a n , are also commonly used. In any case, 
there is no natural ordering of Nq, and different orders will give different 
sequences of orthogonal polynomials. Instead of fixing such a total order, we 
relax Definition 8.1 slightly: 

Definition 8.24. Let p be a non-negative measure on R d . A family of polyno- 
mials Q = {q a | a E Nq} is called a weakly orthogonal system of polynomials 
if q a is such that 

(qa,P)L2(n) = 0 for all p E with deg(p) < |a|. 

The system Q is called a strongly orthogonal system of polynomials if 


(QaiQ/3 )l 2 (h) — 0 ^ ^ ol [3. 
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Hence, in the many- variables case, an orthogonal polynomial of total 
degree n, while it is required to be orthogonal to all polynomials of strictly 
lower total degree, may be non-orthogonal to other polynomials of the same 
total degree n. However, the meaning of orthonormality is unchanged: a sys- 
tem of polynomials {p a \ a £ Nq} is orthonormal if 


{Pa,Pp)L 2 (/i) = S a p. 

While the computation of orthogonal polynomials of many variables is, in 
general, a difficult task, it is substantially simpler if the measure is a product 
measure: multivariate orthogonal polynomials can be obtained as products 
of univariate orthogonal polynomials. 

Theorem 8.25. Suppose that p = pi is a product measure on and 

that, for each i = 1 , . . . , d, = { q a- | cq £ A fi} is a system of orthogonal 
polynomials for the marginal measure pi on R. Then 

d ( d 

Q = 0s (i) = ?«:=n^ 

i = 1 l i—1 

is a strongly orthogonal system of polynomials for p in which deg (q a ) = \a\. 

Proof. It is clear that q a , as defined above, has total degree \a\. Let q a and 
qp be distinct polynomials in the proposed orthogonal system Q. Since a /?, 
it follows that a and [3 differ in at least one component, so suppose without 
loss of generality that aq f3\. By Fubini’s theorem, 



(9a,9/3>L 2 ( M ) 


q a qp d/i 




HU 1 ) 

a i 


q ( P d/xi 


d/j, 2 <s>- ■ -®Hd- 


But, since Q ( 1 1 is a system of orthogonal univariate polynomials for /_/, | . and 
since oq 7 ^ /?i, 


q£i Odqpi ( x i) d-t 1 ihi) = O' 


Hence, (q a , qp) L *(n) = 0- 

On the other hand, for each polynomial q a £ Q, 


lka|ll 2 (/i) 


qW 

0.a i 


2 

L 2 (pi) 


q(2) 

2 


2 

L 2 (u 2 ) 


Jd) 


2 

LHud) 1 


which is strictly positive by the assumption that each is a system of 
orthogonal univariate polynomials for /q. 

Hence, (q a , qp) l 2 ^) = 0 if and only if a and f3 are distinct, so Q is a 
system of strongly orthogonal polynomials for p. □ 
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Gottlieb and Shu (1997). The UQ applications of wavelet bases are discussed 
by Le Maitre and Knio (2010, Chapter 8) and in articles of Le Maitre et al. 
(2004a, b, 2007). 


8.9 Exercises 

Exercise 8.1. Prove that the L 2 (M, /i) inner product is positive definite on 
the space of all polynomials if all polynomials are /i-integrable and the 
measure /i has infinite support. 

Exercise 8.2. Prove Theorem 8.7. That is, show that if fi has finite moments 
only of degrees 0, 1 , . . . , r, then /i admits only a finite system of orthogonal 
polynomials <70, • • • , where d = min{ \r / 2j , # supp(/i) — 1}. 

Exercise 8.3. Define a Borel measure, the Cauchy-Lorentz distribution , /i 
on R by 

d /i 1 1 

— (x) = 5-. 

ax it 1 T x z 

Show that /i is a probability measure, that dim L 2 (M, /x; R) = 00, find all 
orthogonal polynomials for /i, and explain your results. 

Exercise 8.4. Following the example of the Cauchy-Lorentz distribution, 
given £ £ [0, 00), construct an explicit example of a probability measure 
fi £ A4i(R) with moments of orders up to £ but no higher. 
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Exercise 8 . 5 . Calculate orthogonal polynomials for the generalized Maxwell 

2 

distribution dp(x) = x a e~ x dx on the half-line [0, oo), where a > —1 is a 
constant. The case a = 2 is known as the Maxwell distribution and the case 
a = 0 as the half-range Hermite distribution. 

Exercise 8.6. The coefficients of any system of orthogonal polynomials are 
determined, up to multiplication by an arbitrary constant for each degree, by 
the Hankel determinants of the polynomial moments. Show that, if m n and 
H n are as in ( 8 . 1 ), then the degree- n orthogonal polynomial q n for /a is 


i.e. 


m n 


q n (x) = c n det 


H 


n 


™'2n-\ 


q n (x) = c n det 


m 0 


m 2 

m n 

mi 

m2 

ra 3 

m n -\-i 


m n —i 

m n ^n+2 

rei 2 n—i 

1 

G 

to 

x n 


5 


where, for each n, c n 7^ 0 is an arbitrary choice of normalization (e.g. c n = 1 
for monic orthogonal polynomials). 

Exercise 8.7. Let /a be the probability distribution of Y := e x , where 
X ^ J\f( 0, 1) is a standard normal random variable, i.e. let p be the standard 
log-normal distribution. The following exercise shows that the system Q = 
{qk | k G Nq} of orthogonal polynomials for /a is not a complete orthogonal 
basis for T 2 ((0, 00), /x; R). 

(a) Show that fi has the Lebesgue density function p: R R given by 


1 

p(y) : = l \v > 0] — 7^= ex p 

2/V27T 



(b) Let / G L 1 (M, /x; R) be odd and 1 -periodic, i.e. f(x) = —f(—x) = f(x-\- 1 ) 
for all xGl. Show that, for all k G No, 


00 

y k f(}ogy) d n(y) = 0. 

(c) Let g := f o log and suppose that g G L 2 (( 0,oo),/x;R). Show that the 
expansion of g in the orthogonal polynomials {qk \ k G No} has all 
coefficients equal to zero, and thus that this expansion does not converge 
to g when g ^ 0. 
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Exercise 8.8. Complete the proof of Theorem 8.9 by deriving the formula 
for /3 n . 

Exercise 8.9. Calculate the orthogonal polynomials of Table 8.2 by hand 
for degree at most 5, and write a numerical program to compute them for 
higher degree. 

Exercise 8.10. Let xo,...,x n be distinct nodes in an interval I and let 
£i G ^P< n be the associated Lagrange basis polynomials. Define A: I -a R by 


\{x) := ^2 \Vi( x ) 

i= 0 


Show that, with respect to the supremum norm ||/||oo : = sup xG/ 1 / 601 , the 
polynomial interpolation operator 77 from C°(7; M) to \p<„ has operator norm 
given by 

II n || 0 p II ^ || oo •> 

i.e. the Lebesgue constant of the interpolation scheme is the supremum norm 
of the sum of absolute values of the Lagrange basis polynomials. 

Exercise 8.11. Using the three-term recurrence relation (n + l)Le n +i(x) = 
(2n + l)xLe n (x) — nLe n _i(x), prove by induction that, for all n G No, 


— Le„(x) 
ax 


-5 r(^Le ra (a;) - Le„_i(a;)), 

x A — 1 


and 




Le n (x) 


—n(n + l)Le n (x). 


Exercise 8.12. Let 7 = A/"(0, 1) be standard Gaussian measure on R. Es- 
tablish the integration-by-parts formula 


/ f{x)g'(x) dy(x) 

Jr 



xf(x))g(x) dj(x). 


Using the three-term recurrence relation He n+ i(x) = xHe n (x) — nHe n _i(r), 
prove by induction that, for all n G No, 


— He n (x) = nHe n _i 



5 


and 



He n (» 


-nHe n (r). 


Exercise 8.13 (Spectral convergence of Hermite expansions). Let 7 = 
A/"(0, 1) be standard Gaussian measure on R. Use Exercise 8.12 to mimic 
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the proof of Theorem 8.23 to show that there is a constant C/ > 0 that may 
depend upon k but is independent of d and / such that, for all / E H k (R , 7), 
/ and its degree d expansion in the Hermite orthogonal basis of L 2 (M, 7) 
satisfy 

11/ - n d f\\ L 2^) < Ckd~ k / 2 ||/||#fc( 7 ). 

Exercise 8.14 (Spectral convergence for classical orthogonal polynomial 
expansions). Let Q = {q n \ n E No} be orthogonal polynomials for an abso- 
lutely continuous measure d/i = w(x) dx on R, where the weight function w 
is proportional to g^y exp (J dx) with L linear and Q quadratic, which 

are eigenfunctions for the differential operator C = Q(x)j^ + L(x)-^ with 
eigenvalues A n = + L 7 ). 

(a) Show that ji has an integration-by-parts formula of the following form: 
for all smooth functions / and g with compact support in the interior of 
supp(//), 

[ f(x)g'(x) dfi(x) = - [ (Tf)(x)g(x)dfjL(x), 


where 


(■ T f )( x ) = f(x) + f(x) 


L(x) - Q'(x) 
Q(x) 


(b) Hence show that, for smooth enough /, Cf = T 2 (Qf ) 

(c) Hence show that, whenever / has 2m derivatives, 


T(Lf) 


(/? < 7u)l 2 (/j,) — 


(C m f,qn)L*M 


\ rri 
^ n 


Show also that C is a symmetric and negative semi-definite operator 
(i.e. {Cf,g) L 2 (fl) = {f,Cg}Li(n) and < 0), so that (-£) has 

a square root (— C) 1 /" 2 , and C has a square root £ 1 ' 2 = 

(d) Conclude that there is a constant Ck > 0 that may depend upon k but is 
independent of d and / such that / : R R and its degree d expansion 
II df in the basis Q of L 2 (M, fi) satisfy 


f - n d f\\ L 2 W <c k \\ d \- k ' 2 \\c k ' 2 f\ 


L 2 (m)‘ 


8.10 Tables of Classical Orthogonal Polynomials 


Tables 8.2 and 8.3 and Figure 8.4 on the next pages summarize the key 
properties of the classical families of orthogonal polynomials associated with 
continuous and discrete probability distributions on R. More extensive infor- 
mation of this kind can be found in Chapter 22 of Abramowitz and Stegun 
(1992) and in Chapter 18 of the NIST Handbook (Olver et ah, 2010), and 
these tables are based upon those sources. 
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Table 8.2: Summary of some commonly-used orthogonal polynomials, their associated probability distributions 
and key properties. Here, (x) n := x(x + 1) . . . (x + n — 1) denotes the rising factorial or Pochhammer symbol 
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Table 8.3: Continuation of Table 8.2. 
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Chapter 9 

Numerical Integration 


A turkey is fed for 1000 days — every 
day confirms to its statistical department 
that the human race cares about its welfare 
“with increased statistical significance”. On 
the 1001 st day, the turkey has a surprise. 


The Fourth Quadrant: A Map of the Limits 

of Statistics 
Nassim Taleb 


The topic of this chapter is the numerical evaluation of definite integrals. 


Many UQ methods have at their core simple probabilistic constructions such 
as expected values, and expectations are nothing more than Lebesgue inte- 
grals. However, while it is mathematically enough to know that the Lebesgue 
integral of some function exists , practical applications demand the evaluation 
of such an integral — or, rather, its approximate evaluation. This usually 
means evaluating the integrand at some finite collection of sample points. It 
is important to bear in mind, though, that sampling is not free (each sam- 
ple of the integration domain, or function evaluation, may correspond to a 
multi-million-dollar experiment) and that practical applications often involve 
many dependent and independent variables, i.e. high-dimensional domains 
of integration. Hence, the accurate numerical integration of integrands over 
high-dimensional spaces using few samples is something of a ‘Holy Grail’ in 
this area. 

The topic of integration has a long history, being along with differentia- 
tion one of the twin pillars of calculus, and was historically also known as 
quadrature. Nowadays, quadrature usually refers to a particular method of 
numerical integration, namely a finite-sum approximation of the form 



n 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-9 
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where the nodes x\,...,x n E G and weights . . . ,w n G R are chosen 
depending only upon the measure space (0,J^,/x), independently of the 


integrand / : 0 — > R. This chapter will cover three principal forms of quadra- 


ture that are distinguished by the manner in which the nodes are gener- 
ated: classical deterministic quadrature, in which the nodes are determined 
in a deterministic fashion from the measure /x; random sampling (Monte 
Carlo) methods, in which the nodes are random samples from the measure 
fi] and pseudo-random (quasi-Monte Carlo) methods, in which the nodes 
are in fact deterministic, but are in some sense ‘approximately random’ and 
/x-distributed. Along the way, there will be some remarks about how the 
various methods scale to high-dimensional domains of integration. 


9.1 Univariate Quadrature 

This section concerns the numerical integration of a real-valued function / 
with respect to a measure /x on a sub-interval I C R, doing so by sampling 
the function at pre-determined points of I and taking a suitable weighted 
average. That is, the aim is to construct an approximation of the form 


r n 



with prescribed nodes x \ , . . . , x n e I and weights w\ , . . . , w n e R. The app- 
roximation Q(f) is called a quadrature formula. The aim is to choose nodes 
and weights wisely, so that the quality of the approximation f f f d/x Q(f) 
is good for a large class of integrands /. One measure of the quality of the 
approximation is the following: 

Definition 9.1. A quadrature formula is said to have order of accuracy 
n G No if fjP d/x = Q(p) whenever p £ ^3< n , i.e. if it exactly integrates every 
polynomial of degree at most n. 

A quadrature formula Q(f) = Y17= l can be identified with the 

discrete measure If some of the weights Wi are negative, then 

this measure is a signed measure. This point of view will be particularly 
useful when considering multi-dimensional quadrature formulae. Regardless 
of the signature of the weights, the following limitation on the accuracy of 
quadrature formulae is fundamental: 

Lemma 9.2. Let g be a non-negative measure on an interval I C R. Then 
no quadrature formula with n distinct nodes in the interior of supp(/x) can 
have order of accuracy 2 n or greater. 
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Proof. Let x \ , . . . , x n be any n distinct points in the interior of the support of 
/r, and let . . . , w n G R be any weights. Let / be the degree-2n polynomial 
fi x ) '■= Yl]=i( x ~ x j) 2 , i-e- the square of the nodal polynomial. Then 

P n 

/ fix) d/j{x) > 0 = W Wjfixj), 

Jl i = i 

since / vanishes at each node Xj. Hence, the quadrature formula is not exact 
for polynomials of degree 2 n. □ 

The first, simplest, quadrature formulae to consider are those in which 
the nodes form an equally spaced discrete set of points in [a, b\. Many of 
these quadrature formulae may be familiar from high-school mathematics. 
Suppose in what follows that fi is Lebesgue measure on the interval [a, b\. 

Definition 9.3 (Midpoint rule). The midpoint quadrature formula has the 
single node x\ := and the single weight w\ := \b — a\. That is, it is the 
approximation 

J fix ) Ax « Qlif) := f f- yT I b - a\. 

Another viewpoint on the midpoint rule is that it is the approximation 
of the integrand / by the constant function with value The next 

quadrature formula, on the other hand, amounts to the approximation of / 
by the affine function 


x^f{a) + ^ifib)-f{a)) 
b — a 

that equals /(a) at a and f{b) at b. 

Definition 9.4 (Trapezoidal rule). The trapezoidal quadrature formula has 
the nodes x\ := a and X 2 := b and the weights w\ := ^ b ~ a ^ and W 2 := • 

That is, it is the approximation 

f[x) Ax « Q 2 (f) ■■= (/(a) + fib)) A— d 

Recall the definitions of the Lagrange basis polynomials £j and the 
Lagrange interpolation polynomial L for a set of nodes and values from 
Definition 8.17. The midpoint and trapezoidal quadrature formulae amount 
to approximating / by a Lagrange interpolation polynomial L of degree 0 
or 1 and hence approximating J ^ f(x) dx by J h L(x) dx. The general such 
construction for equidistant nodes is the following: 
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Definition 9.5 (Newton-Cotes formula). Consider n + 1 equally spaced 
points 

a = xo < x\ = xq + h < X 2 = xo + 2h < • • • < x n = 6, 

where h = The closed Newton-Cotes quadrature formula is the quadrature 
formula that arises from approximating / by the Lagrange interpolating poly- 
nomial L that runs through the points (xj , f(xj))f = 0 ; the open Newton-Cotes 
quadrature formula is the quadrature formula that arises from approximating 
/ by the Lagrange interpolating polynomial L that runs through the points 

(XjJiXj))]-'. 

In general, when a quadrature rule is formed based upon an polynomial 
interpolation of the integrand, we have the following formula for the weights 
in terms of the Lagrange basis polynomials: 

Proposition 9.6. Given an integrand /: [a, b] — > R and nodes xo, • • • ,x n in 
[a, b\, let Lf denote the (Lagrange form) degree-n polynomial interpolant of 
f through xo, . . . , x n , and let Q denote the quadrature rule 

pb pb 

/ f(x)dxnsQ(f):= / Lf(x)dx. 

J a J a 

Then Q is the quadrature rule Q(f) = ^2 7 j = o w jf( x j) weights 


Wj 


£j (x) dx. 


Proof. Simply observe that 





£j (x) dx. 


□ 


The midpoint rule is the open Newton-Cotes quadrature formula on three 
points; the trapezoidal rule is the closed Newton-Cotes quadrature formula 
on two points. Milne’s rule is the open Newton-Cotes formula on five points; 
Simpson’s rule, Simpson’s | rule, and Boole’s rule are the closed Newton- 
Cotes formulae on three, four, and five points respectively. The quality of 
Newton-Cotes quadrature formulae can be very poor, essentially because 
Runge’s phenomenon can make the quality of the approximation / « L very 
poor. 

Remark 9.7. In practice, quadrature over [a, b } is often performed by taking 
a partition (which may or may not be uniform) 


a = po < pi < • • • < pk = b 
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of the interval [a, &], applying a primitive quadrature rule such as the ones 
developed in this chapter to each subinterval [pi-i,Pi], and taking a weighted 
sum of the results. Such quadrature rules are called compound quadrature 
rules. For example, the elementary n-point ‘Riemann sum’ quadrature rule 



(9.1) 


is a compound application of the mid-point quadrature formula from Defini- 
tion 9.3. Note well that (9.1) is not the same as the n-point Newton-Cotes 
rule. 


9.2 Gaussian Quadrature 


Gaussian quadrature is a powerful method for numerical integration in which 
both the nodes and the weights are chosen so as to maximize the order of 
accuracy of the quadrature formula. Remarkably, by the correct choice of 
n nodes and weights, the quadrature formula can be made accurate for all 
polynomials of degree at most 2n— 1. Moreover, the weights in this quadrature 
formula are all positive, and so the quadrature formula is stable even for high 
n; see Exercise 9.1 for an illustration of the shortcomings of quadrature rules 
with weights of both signs. 

Recall that the objective of quadrature is to approximate a definite integral 
f b f(x) d/i(x), where g is a (non-negative) measure on [a, b] by a finite sum 
Qn{f) := EL -i w jf( x j )? where the nodes aq, . . . , x n and weights nq, . . . , w n 
will be chosen appropriately. For the method of Gaussian quadrature, let 
Q = {q n | n G A f} be a system of orthogonal polynomials for fi. That is, q n 
is a polynomial of degree exactly n such that 


p(x)q n {x) dp(x) = 0 


for all p G ^3< n _i. 


Recalling that, by Theorem 8.16, q n has n distinct roots in [a, 6], let the nodes 
aq, . . . , x n be the zeros of q n . 

Definition 9.8. The n-point Gauss quadrature formula Q n is the quadrature 
formula with nodes (sometimes called Gauss points ) aq, . . . ,x n given by the 
zeros of the orthogonal polynomial q n and weights given in terms of the 
Lagrange basis polynomials G for the nodes aq , . . . , x n by 


Wi 



G d fi 



rp . 

j 


dfi{x). 


rp . rp . 

^ j 


(9.2) 
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If p G ^}<n— i, then p obviously coincides with its Lagrange- form interpo- 
lation on the nodal set {aq, . . . , x n }, i.e. 


n 

p(x) = f° r a U ^ G R. 

2=1 


Therefore, 

/ b pb n n 

p(x)dp(x) = / yp(^)^(x)d^(a:) = y^ y p(xj)wj =: Q n (p ), 

'* 0 2=1 2 = 1 

and so the n-point Gauss quadrature rule is exact for polynomial integrands 
of degree at most n — 1. However, Gauss quadrature in fact has an optimal 
degree of polynomial exactness: 

Theorem 9.9. The n-point Gauss quadrature formula has order of accuracy 
exactly 2n — 1, and no quadrature formula on n nodes has order of accuracy 
higher than this. 

Proof. Lemma 9.2 shows that no quadrature formula can have order of 
accuracy greater than 2n — 1. 

On the other hand, suppose that p G *P< 2 n-i- Factor this polynomial as 


p{x) = g(x)q n (x) +r(x), 


where deg(p) < n — 1, and the remainder r is also a polynomial of degree at 
most n — 1. Since q n is orthogonal to all polynomials of degree at most n — 1, 
fa gq n dp = 0. However, since g(xj)q n (xj ) = 0 for each node Xj, 

22 

Qn(gq n ) = w jg(xj)q n (xj) = 0. 

3 = 1 


Since j b • dp and Q n (-) are both linear operators, 



rdp and Q n {p) = Q n (r). 


Since deg(r) < n— 1 , f^ r dp = Q n (r) , and so f^ pdp = Q n (p) , as claimed. □ 

Recall that the Gauss weights were defined in (9.2) by ny := fa U dp. The 
next theorem gives a neat expression for the Gauss weights in terms of the 
orthogonal polynomials {q n \ n G A f}. 
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Theorem 9.10. The Gauss weights for a non-negative measure fi satisfy 

_ an fa qn-l(x) 2 dfl(x) 

3 a n - 1 q'nOVqn-xOj) 

where is the coefficient of x k in qk(x). 

Proof. First note that 


n ( x ~ x i) 

l<j<n 


1 

ry* ry* . 

tAy *aj ^ 


n ( x - x v 

l<j<n 


l q n {x) 
a n x — Xi 


Furthermore, taking the limit x x\ using l’Hopital’s rule yields 


n ( x i~ x j) 

l<j<n 


Qn( X i) 

a n 


Therefore, 


Wi = 




d/j,(x). 


The remainder of the proof concerns this integral 
Observe that 



q n (x) 


dfi(x). 


(9.4) 


1 

ry* ry* . 

tAy *aj 2 



(9.5) 


and that the first term on the right-hand side is a polynomial of degree at 
most k — 1. Hence, upon multiplying both sides of (9.5) by q n and integrating, 
it follows that, for k < n, 


dfiix) = x\ [ ^ n ^ dfi(x) 
J a ^ 


x k q n (x) 


ry* . 

T 7 


Hence, for any polynomial p G *J3< n , 

f h p(x)q n (x ) 


d/i(;c) = p(xi) [ V-Vll cL^(x) 


f ry* ry* . I ry* ry* . 

a j a 


In particular, for p = g n _i, since deg ( x ( f x ) < n — 1, write 

= a n x n ~ 1 + s(x) for some s G *J3< n _2 


/p ry* . 

T tF 7 


n — l qn— l(X-)\ — l(*x) / \ 

= a„ | x — ) H — + s(x) 


&n — 1 


C^n—1 


(9.6) 
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Since the first and third terms on the right-hand side are orthogonal to q n ~ 1 , 
(9.6) with p = q n -i implies that 



q n (x) 

rp rp . 

tO tO ^ 


dp{x) 


1 


1 (*^) 


dn — 1 (^h) 
CL n 


tin — 1 
/>6 


<7 n _i(x) d/i(x) 


^n—ldn—l (*G ) 


<7n-i(^) 2 d/i(x). 


Substituting this into (9.4) yields (9.3). □ 

Furthermore: 

Theorem 9.11. For any non-negative measure p on R, the Gauss quadrature 
weights are positive. 

Proof. Fix 1 < i < n and consider the polynomial 


p(x) 


n ( x ~ x F 

l<j<n 


i.e. the square of the nodal polynomial, divided by ( tJO X ^ ) 2 . Since deg(p) < 
2n — 1, the Gauss quadrature formula is exact, and since p vanishes at every 
node other than ay, it follows that 


n 


pdp = Wjp(xj ) = Wip(xi). 


j= 1 


Since p is a non- negative measure, p > 0 everywhere, and p(xi) > 0, it follows 
that Wi > 0. □ 


Finally, we already know that Gauss quadrature on n nodes has the opti- 
mal degree of polynomial accuracy; for not necessarily polynomial integrands, 
the following error estimate holds: 

Theorem 9.12 (Stoer and Bulirsch, 2002, Theorem 3.6.24). Suppose that 
f G C 2n ([a, b\] R). Then there exists £ G [a, b\ such that 

r b f( 2n )(T4 

J f ( x ) d MO) ~ Qn(f) = \\Pn\\ L 2 M , 

where p n is the monic orthogonal polynomial of degree n for p. In particular , 


f(x) dfi(x) - Qn(f) 


< 


\f (2n ) II 
(2n)! 



5 


and the error is zero if f is a polynomial of degree at most 2n — 1 . 


9.3 Clenshaw-Curtis/Fejer Quadrature 
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In practice, the accuracy of a Gaussian quadrature for a given integrand / 
can be estimated by computing Q n (f ) and Q m {f ) for some m > n. However, 
this can be an expensive proposition, since none of the evaluations of / for 
Q n (f ) will be re-used in the calculation of Q m (/). This deficiency motivates 
the development of nested quadrature rules in the next section. 


9.3 Clenshaw— Curtis/Fejer Quadrature 

Despite its optimal degree of polynomial exactness, Gaussian quadrature 
has some major drawbacks in practice. One principal drawback is that, by 
Theorem 8.16, the Gaussian quadrature nodes are never nested — that is, 
if one wishes to increase the accuracy of the numerical integral by passing 
from using, say, n nodes to 2 n nodes, then none of the first n nodes will be 
re-used. If evaluations of the integrand are computationally expensive, then 
this lack of nesting is a major concern. Another drawback of Gaussian quadra- 
ture on n nodes is the computational cost of computing the weights, which 
is 0(n 2 ) by classical methods such as the Golub- Welsch algorithm, though 
there also exist O(n) algorithms that are more expensive than the Golub- 
Welsch method for small n, but vastly preferable for large n. By contrast, the 
Clenshaw-Curtis quadrature rules (although in fact discovered thirty years 
previously by Fejer) are nested quadrature rules, with accuracy comparable 
to Gaussian quadrature in many circumstances, and with weights that can 
be computed with cost O(nlogn). 

The Clenshaw-Curtis quadrature formula for the integration of a function 
/: [—1, 1] ^ with respect to uniform (Lebesgue) measure on [—1, 1] begins 

with a change of variables: 



l 


Now suppose that / has a cosine series 


oo 



k=l 


where the cosine series coefficients are given by 


a/c = — [ /(cos 6) cos(kO) dO 

n J o 


L 


7 r 



If so, then 
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By the Nyquist-Shannon sampling theorem, for k < n, ak can be computed 
exactly by evaluating /(cos0 ) at n + 1 equally spaced nodes {0 3 — 3 ~ \ j — 
0, where the interior nodes have weight ^ and the endpoints have 

wight 



(-N 


/(-l) 

2 



(9.7) 


For k > n, formula (9.7) for aj- is false, and falls prey to aliasing error : 
sampling 0 er cos(0) and 0 i-a cos ((n + 1)0) at n + 1 equally spaced nodes 
produces identical sequences of sample values even though the functions being 
sampled are distinct. 1 * Clearly, this choice of nodes has the nesting property 
that doubling n produces a new set of nodes containing all of the previous 
ones. 

Note that the cosine series expansion of / is also a Chebyshev polynomial 
expansion of /, since by construction 77 (cos 0 ) = cos (kQ): 


Six) = yT 0 (x) + Y j a k T k {x). (9.8) 

k = 1 


The nodes Xj = cos ^ are the extrema of the Chebyshev polynomial T n . 

In contrast to Gaussian quadrature, which evaluates the integrand at n + 1 
points and exactly integrates polynomials up to degree 2n + 1, Clenshaw- 
Curtis quadrature evaluates the integrand at n + 1 points and exactly inte- 
grates polynomials only up to degree n. However, in practice, the fact that 
Clenshaw-Curtis quadrature has lower polynomial accuracy is not of great 
concern, and has accuracy comparable to Gaussian quadrature for ‘most’ 
integrands (which are ipso facto not polynomials). Heuristically, this may 
be attributed to the rapid convergence of the Chebyshev expansion (9.8). 
Trefethen (2008) presents numerical evidence that the ‘typical’ error for both 
Gaussian and Clenshaw-Curtis quadrature of an integrand in C k is of the 
order of ^ 2 n) k k ' This comparable level of accuracy, the nesting property of 
the Clenshaw-Curtis nodes, and the fact that the weights can be computed 
in O(nlogn) time, make Clenshaw-Curtis quadrature an attractive option 
for numerical integration. 


1 This is exactly the phenomenon that makes car wheels appear to spin backwards instead 

of forwards in movies. The frame rates in common use are / = 24, 25 and 30 frames per 
second. A wheel spinning at / revolutions per second will appear to be stationary; one 
spinning at / + 1 revolutions per second (i.e. 1 + j revolutions per frame) will appear to 

be spinning at 1 revolution per second; and one spinning at / — 1 revolutions per second 
will appear to be spinning in reverse at 1 revolution per second. 
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9.4 Multivariate Quadrature 


Having established quadrature rules for integrals with a one-dimensional 
domain of integration, the next agendum is to produce quadrature formu- 
lae for multi-dimensional (i.e. iterated) integrals of the form 



f(x) dx 




f(x i,...,x d )dx 1 ...dx d . 


This kind of multivariate quadrature is also known as cubature. At first sight, 
multivariate quadrature does not seem to require mathematical ideas more 
sophisticated than univariate quadrature. However, practical applications 
often involve high-dimensional domains of integration, which leads to an exp- 
onential growth in the computational cost of quadrature if it is performed 
naively. Therefore, it becomes necessary to develop new techniques in order 
to circumvent this curse of dimension. 

Tensor Product Quadrature Formulae. The first, obvious, strategy to 
try is to treat d-dimensional integration as a succession of d one-dimensional 
integrals and apply our favourite one-dimensional quadrature formula d 
times. This is the idea underlying tensor product quadrature formulae , and it 
has one major flaw: if the one-dimensional quadrature formula uses n nodes, 
then the tensor product rule uses N = n d nodes, which very rapidly leads to 
an unpractically large number of integrand evaluations for even moderately 
large values of n and d. In general, when the one-dimensional quadrature 
formula uses n nodes, the error for an integrand in C r using a tensor product 
rule is O (n~ r / d ). 

Remark 9.13 (Sobolev spaces for quadrature). The Sobolev embedding 
theorem (Morrey’s inequality) only gives continuity, and hence well-defined 
pointwise values, of functions in H S (X) when 2s > dimT. Therefore, since 
pointwise evaluation of integrands is a necessary ingredient of quadrature, the 
correct Sobolev spaces for the study of multidimensional quadrature rules are 
the spaces Fk x (^) of dominating mixed smoothness. Whereas the norm in 
H S (X) is, up to equivalence, 


u\\H‘(X) - 


ckIIi <s 


(p II a II 1 


U 


dx 


a 


L 2 (X) 


the norm in H* nix (X) is, up to equivalence, 


u\\h s . (X) — 7 v 

1 1 mix v / / 


I Ot I I ryr> ^ 5 


Q II a II i 


U 


dx 


a 


L 2 (X) 
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Fig. 9.1: Illustration of the nodes of the 2-dimensional Smolyak sparse quadra- 
ture formulae Q\ ; for levels £ = 1, . . . , 6, in the case that the 1-dimensional 

quadrature formula Q^ 1 - has 2^ — 1 equally spaced nodes in the interior of 
[0, 1], i.e. is an open Newton-Cotes formula. 


So, for example, in two or more variables, is a space intermediate 

between H 1 (T) and H 2 (X), and is a space in which pointwise evaluation 
always makes sense. In particular, functions in enjoy H s regularity 

in every direction individually, and ‘using’ derivative in one direction does 
not ‘deplete’ the number of derivatives available in any other. 

Sparse Quadrature Formulae. The curse of dimension, which quickly 
renders tensor product quadrature formulae impractical in high dimension, 
spurs the consideration of sparse quadrature formulae , in which far fewer than 
n d nodes are used, at the cost of some accuracy in the quadrature formula: in 
practice, we are willing to pay the price of loss of accuracy in order to get any 
answer at all! One example of a popular family of sparse quadrature rules 
is the recursive construction of Smolyak sparse grids , which is particularly 
useful when combined with a nested one-dimensional quadrature rule such as 
the Clenshaw-Curtis rule. 

Definition 9.14. Suppose that, for each £ E N, a one-dimensional quadra- 
ture formula is given. Suppose also that the quadrature rules are nested, 
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i.e. the nodes for are a subset of those for The Smolyak quadra- 

ture formula in dimension d G N at level £ E N is defined in terms of the 
lower-dimensional quadrature formulae by 

Qe d \f) ■= fe (^F - Qi-i) ® Qti+i) (/) (9-9) 

Formula (9.9) takes a little getting used to, and it helps to first consider 
the case d = 2 and a few small values of £. First, for £ = 1, Smolyak’s rule 
is the quadrature formula Q ^ ® Q^\ i.e. the full tensor product of 

the one-dimensional quadrature formula with itself. For the next level, 
£ = 2, Smolyak’s rule is 

Q'? = £ («" - «&) ® 

i — 1 

= qP ® + (q£ } - qP) ® qP 

= qP ® + Q2 1} ® ei 1} - ® eft 

The ® QP” term is included to avoid double counting. See Figure 9.1 

for illustrations of Smolyak sparse grids in two dimensions, using the simple 
(although practically undesirable, due to Runge’s phenomenon) Newton- 
Cotes quadrature rules as the one-dimensional basis for the sparse product. 

In general, when the one-dimensional quadrature formula at level i uses ri£ 
nodes, the quadrature error for an integrand in C r using Smolyak recursion is 



f(x) da; - Q(f) 


0(n e r (logn f y d 1 ^ r+1 )). 


(9.10) 


In practice, one needs a lot of smoothness for the integrand /, or many sample 
points, to obtain a numerical integral for / that is accurate to within 0 < e <C 
1: the necessary number of function evaluations grows like d~ clog£ for c > 0. 
Note also that the Smolyak quadrature rule includes nodes with negative 
weights, and so it can fall prey to the problems outlined in Exercise 9.1. 

Remark 9.15 (Sparse quadratures as reduced bases). As indicated above in 
the discussion preceding Definition 9.5, there is a deep connection between 
quadrature and interpolation theory. The sparse quadrature rules of Smolyak 
and others can be interpreted in the interpolation context as the deletion 
of certain cross terms to form a reduced interpolation basis. For example, 
consider the Smolyak-Newton-Cotes nodes in the square [—1,1] X [-1,1] as 
illustrated in Figure 9.1. 

(a) At level £ = 1, the only polynomial functions in the two variables x\ and 
X 2 that can be reconstructed exactly by interpolation of their values at 
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the unique node are the constant functions. Thus, the interpolation basis 
at level £ = 1 is just {1} and the interpolation space is ^3< 0 - 
(b) At level £ = 2, the three nodes in the first coordinate direction allow 
perfect reconstruction of quadratic polynomials in x\ alone; similarly, 
quadratic polynomials in X 2 alone can be reconstructed. However, it is not 
true that every quadratic polynomial in x\ and x 2 can be reconstructed 
from its values on the sparse nodes: p(xi,X 2 ) = X 1 X 2 is a non-trivial 
quadratic that vanishes on the nodes. Thus, the interpolation basis at 
level £ = 2 is 

{l,Xi,X2,Xi,xl}, 

and so the corresponding interpolation space is a proper subspace of *$< 2 - 
In contrast, the tensor product of two 1-dimensional 3-point quadrature 
rules corresponds to the full interpolation space ^3< 2 - 


9.5 Monte Carlo Methods 


As seen above, tensor product quadrature formulae suffer from the curse of 
dimensionality: they require exponentially many evaluations of the integrand 
as a function of the dimension of the integration domain. Sparse grid con- 
structions only partially alleviate this problem. Remarkably, however, the 
curse of dimensionality can be entirely circumvented by resorting to random 
sampling of the integration domain — provided, of course, that it is possi- 
ble to draw samples from the measure against which the integrand is to be 
integrated. 

Monte Carlo methods are, in essence, an application of the Law of Large 
Numbers (LLN). Recall that the LLN states that if y( 1 ),y( 2 ), . . . are ind- 
ependently and identically distributed according to the law of a random vari- 
able Y with finite expectation E[Y], then the sample average 



i — 1 


converges in some sense to K[Y] as n 00 . The weak LLN states that the 
mode of convergence is convergence in probability: 


for all e > 0, lim P 

n— 7 >oo 



E[Y] 


> £ 



whereas the strong LLN states that the mode of convergence is actually 
almost sure: 


1 

lim - V Y {i) = E [Y] 

n— ^00 n ' / 

i — 1 


P 


1 . 
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The LLN is further generalized by the Birkhoff-Khinchin ergodic theorem, 
and indeed ergodicity properties are fundamental to more advanced variants 
of Monte Carlo methods such as Markov chain Monte Carlo. 

Remark 9.16. The assumption that the expected value exists and is finite is 
essential. If this assumption fails, then Monte Carlo estimates can give app- 
arently plausible but ‘infinitely wrong’ results; in particular, a ducky’ Monte 
Carlo run may appear to converge to some value and mislead a practitioner 
into believing that the expected value of Y has indeed been found. 

For example, suppose that X ~ 7 = A7(0, 1) is a standard normal random 
variable, and let a E R. Now take 


a — X' 


Note that Y is 7-almost surely finite. However, E 7 [T] is undefined: if E 7 [Y] 
did exist, then x ^ \a — x\~ x would have to be 7-integrable, and indeed 
it would have to be integrable with respect to Lebesgue measure on some 
neighbourhood of a, which it is not. 

It is interesting to observe, as illustrated in Figure 9.2, that for small values 
of a, Monte Carlo estimates of E [Y] are obviously poorly behaved; seeing 
these, one would not be surprised to learn that E 7 [Y] does not exist. However, 
for \a\ 1, the Monte Carlo average appears (but only appears) to converge 

to a -1 , even though E 7 [Y] still does not exist. There is, in fact, no qualitative 
difference between the two cases illustrated in Figure 9.2. That the Monte 
Carlo average cannot, in fact, converge to a~ l follows from the following 
result, which should be seen as a result in the same vein as Kolmogorov’s 
zero-one law (Theorem 2.37): 


Theorem 9.17 (Kesten, 1970). Let Y p be a real-valued random variable 
for which E M [Y] is undefined, i.e. E M [max{0, Y}] = E M [max{0, — T}] = +00. 
Let (Y^) ieN be a sequence of i.i.d. draws from p, and let S n := ^ ^ 2 • 

Then exactly one of the following holds true: 


(a) P M 

(b) Pm 

(c) Pm 


lim n _^oG — +00 — 1 ; 
lim n ^oo S n — 00 — 1 , 

lim inf n ^oo S n = —00 and limsup n _ 5>00 S n = +00 


= 1. 


‘Vanilla’ Monte Carlo. The simplest formulation of Monte Carlo integra- 
tion applies the LLN to the random variable Y = /(X), where / is the 
function to be integrated with respect to a probability measure p and X is 
distributed according to p. Assuming that one can generate independent and 
identically distributed samples X^\X^ 2 \ . . . from the probability measure 
p, the n th Monte Carlo approximation is 

1 n 

0] ~ S n (f) := -£/(X (i >). 

n A ' 
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Fig. 9.2: ‘Convergence’ of S n := ^ X^=i ( a — 1 to a 1 when ~ 

A/"(0, 1) are i.i.d. E[(a + X^^) _1 is undefined for every aGl, which is easily 
guessed from the plot for a = 2, but less easily guessed from the apparent 
convergence of S n to a -1 when a = 8. Each figure shows 10 independent 
Monte Carlo runs. 


To obtain an error estimate for such Monte Carlo integrals, we simply 
apply Chebyshev’s inequality to 5 n (/), which has expected value E[S' n (/)] = 
E [f(X)} and variance 


V[S n (/)] 


n 




i — 1 


v[/P0] 


n 
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to obtain that, for any t > 0, 


V[\S n (f) -E[f(X)]\ >t] 


< 


V[/(M] 

nt 2 


That is, for any e G (0,1], with probability at least l — e with respect 
to the n Monte Carlo samples, the Monte Carlo average S n (f ) lies within 
(¥[/(X)]/ne) 1 / 2 of the true expected value E Thus, for a fixed int- 
egrand /, the error decays like n -1 / 2 regardless of the dimension of the 
domain of integration, and this is a major advantage of Monte Carlo int- 
egration: as a function of the number of samples, n, the Monte Carlo error 
is not something dimension- or smoothness-dependent, like the tensor prod- 
uct quadrature rule’s error of O (n~ r ^ d ). However, the slowness of the n -1 / 2 
decay rate is a major limitation of ‘vanilla’ Monte Carlo methods; it is unde- 
sirable to have to quadruple the number of samples to double the accuracy 
of the approximate integral. That said, part of the reason why the above 
error bound is so ‘bad’ is that it only uses variance information; much better 
bounds are available for bounded random variables, q.v. Hoeffding’s inequal- 
ity (Corollary 10.13). 

One obvious omission in the above presentation of Monte Carlo integration 
is the accessibility of the measure of integration fi. We now survey a few of 
the many approaches to this problem. 

Re- Weighting of Samples. In the case that we wish to evaluate an exp- 
ected value E M [/] for some integrand / against //, but can only easily draw 
samples from some other measure z/, one approach is to re-weight the samples 
of v\ if the density ^ exists and is computationally accessible, then we can 
estimate E M [/] via 


E„[/] 


E„ 


dv 


E/W) dW ( ¥ 


where X^\ . . . , X ^ are independent and identically ^-distributed. Some- 
times, the density ^ is only known up to a normalization constant, i.e. 

^ocp, in which case we use the estimate 

E v _m ~ EM f{xW)p{xU) 

E„[p] ~ EM ' 

A prime example of this situation is integration with respect to a Bayesian 
posterior in the sense of Chapter 6, which is easily expressed in terms of its 
non-normalized density with respect to the prior. Note, though, that while 
this approach yields convergent estimates for expected values of integrals 
against /x, it does not yield //-distributed samples. 
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CDF Inversion. If is a measure on R with cumulative distribution function 



d/i, 


and, moreover, the inverse cumulative distribution function F~ l is computa- 
tionally accessible, then samples from fi can be generated using the implica- 
tion 

U ~ Unif([0, 1]) => F~ l {U) ~ fi. (9.11) 

Similar transformations can be used to convert samples from other ‘stan- 
dard’ distributions (e.g. the Gaussian measure A/"(0, 1) on R) into samples 
from related distributions (e.g. the Gaussian measure A /(m, C ) on R ri ). How- 
ever, in general, such explicit transformations are not available; often, p, 
is a complicated distribution on a high-dimensional space. One method for 
(approximately) sampling from such distributions, when a density function 
is known, is Markov chain Monte Carlo. 


Markov Chain Monte Carlo. Markov chain Monte Carlo (MCMC) meth- 
ods are a class of algorithms for sampling from a probability distribution (i 
based on constructing a Markov chain that has /i as its equilibrium distribu- 
tion. The state of the chain after a large number of steps is then used as a 
sample of fi. The quality of the sample improves as a function of the number 
of steps. Usually it is not hard to construct a Markov chain with the desired 
properties; the more difficult problem is to determine how many steps are 
needed to converge to fi within an acceptable error. 

More formally, suppose that p is a measure on W d that has Lebesgue den- 
sity proportional to a known function p, and although p{pc) can be evaluated 
for any x G M d , drawing samples from p is difficult. Suppose that, for each 
x G q(’ \x) is a probability density on R d that can be both easily eval- 
uated and sampled. The Metropolis-Hastings algorithm is to pick an initial 
state xq G and then iteratively construct x n+ i from x n in the following 
manner: 

(a) draw a proposal state x' from q(- \x n ); 

(b) calculate the acceptance probability a := min{l,r}, where r is the accep- 
tance ratio 

p<X) q(x n \x') _ 

' p(x n ) q{x'\x n y } 


(c) let u be a sample from the uniform distribution on [0, 1]; 

(d) set 


^n+l • — 



(‘accept’) if u < a, 
(‘reject’) if u > a. 


(9.13) 


In the simplest case that q is symmetric, (9.12) reduces to p(x')/p(x n ), and 
so, on a heuristic level, the accept-or-reject step (9.13) drives the Markov 
chain ( x n ) ne ^ towards regions of high //-probability. 
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It can be shown, under suitable technical assumptions on p and < 7 , that the 
random sequence ( x n ) n in has // as its stationary distribution, i.e. for 
large enough n, x n is approximately //-distributed; furthermore, for suffi- 
ciently large n and m, r n ,r n+m ,r n+ 2 m • • • are approximately independent 
//-distributed samples. There are, however, always some correlations between 
successive samples. 

Remark 9.18. Note that, when the proposal and target distributions coin- 
cide, the acceptance ratio (9.12) equals one. Note also that, since only ratios 
of the proposal and target densities appear in (9.12), it is sufficient to know 
<7 and p up to an arbitrary multiplicative normalization factor. 

Example 9.19. To illustrate the second observation of Remark 9.18, con- 
sider the problem of sampling the uniform distribution Unif(E) on a subset 
E C R d . The Lebesgue density of this measure is 

dUnif(E) f 1/ vol(.E), ifxeE, 

dx | 0 , if x ^ E, 

which is difficult to access, since E may have a sufficiently complicated geo- 
metry that its volume vol(E) is difficult to calculate. However, by Remark 
9.18, we can use the MCMC approach with the non-normalized density 


Pe(x) := Ie(x) 


1, if x G E, 
0, if x ^ E, 


in order to sample Unif(E). 

Figure 9.3 shows the results of applying this method to sample the uniform 
distribution on the square S := [ — 1 , 1] 2 — for which, of course, many simpler 
sampling strategies exist — and to sample the uniform distribution on the 
crescent-shaped region 

E := {(x, y) G M 2 | 1 — x 2 < y < 2 — 2x 2 }. 

In each case, after an initial burn-in period of one million steps, every m th 
MCMC sample was taken as an approximate draw from Unif(S') or Unif(E), 
with a stride of m = 200. The proposal distribution was x' ~ A f(x n , |). The 
approximate draws from Unif(S') and Unif(E) are shown in Figure 9.3(b) and 
(c) respectively. Direct draws from the standard uniform distribution on S', 
using an off-the-shelf random number generator, are shown for comparison 
in Figure 9.3(a). To give an idea of the approximate uniformity of the draws, 
the absolute Pearson correlation coefficient of any two components of Figure 
9.3(a) and (b) is at most 0.02, as is the correlation of successive draws. 

Note that simply rescaling the //-coordinates of samples from Unif(S) 
would not yield samples from Unif(E), since samples transformed in this 
way would cluster near the end-points of the crescent; the samples illustrated 
in Figure 9.3(c) show no such clustering. 
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Fig. 9.3: Samples of uniform distributions generated using Metropolis- 
Hastings Markov chain Monte Carlo, as in Example 9.19. In (a), N = 1000 
draws from the uniform distribution on the square [— 1, l] 2 . In (b), N draws 
from the uniform distribution on [— 1 , 1] 2 generated using MH MCMC. In 
(c), N draws from the uniform distribution on a crescent-shaped region. To 
give an additional indication of the approximately uniform distribution of 
the points in space and in ‘time’ n, in each figure they are shaded on a linear 
greyscale from white (at n = 1) to black (at n = TV), a colouring convention 
that is compatible with the unsampled background also being white. 


There are many variations on the basic Metropolis-Hastings scheme that 
try to improve the rate of convergence, decrease correlations, or allow more 
efficient exploration of distributions /i with ‘nasty’ features, e.g. being multi- 
modal, or concentrated on or near a low-dimensional submanifold of R d . 

For example, the HMC approach (‘HMC’ originally stood for hybrid Monte 
Carlo , but Hamiltonian Monte Carlo is also used, and is more descriptive) 
uses gradient-based information and Hamiltonian dynamics to produce the 
proposals for the Metropolis algorithm; this method allows the generation of 
larger jumps x' — x n that still have large acceptance probability a, thereby 
reducing the correlation between successive states, and also can also target 
new states with a higher acceptance probability than the usual Metropolis- 
Hastings algorithm. The reversible- jump MCMC method allows exploration 
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of probability distributions on spaces whose dimension varies during the 
course of the algorithm: this approach is appropriate when ‘the number 
of things you don’t know is one of the things you don’t know’, and when 
used judiciously it also promotes sparsity in the solution (i.e. parsimony of 
explanation) . 

Multi-Level Monte Carlo. A situation that often arises in problems where 
the integrand / is associated with the solution of some ODE or PDE is that 
one has a choice about how accurately to numerically (i.e. approximately) 
solve that differential equation, e.g. through a choice of time step or spa- 
tial mesh size. Of course, a more accurate solution is more computationally 
costly to obtain, especially for PDEs. However, for Monte Carlo methods, 
this problem is actually an opportunity in disguise. 

Suppose that we wish to calculate [/] , but have at our disposal hierarchy 
/o, /i, . . . , /l of approximations to /, indexed by a level parameter £ — as 
mentioned above, the level typically corresponds to a choice of time step or 
mesh size in an ODE or PDE solver. Assume that / = /l; one should think 
of /o as a coarse model for /, /i as a better model, and so on. By the linearity 
property of the expectation, 


L 


E Af] = Ed/d = Ed/o] + E E dd - fe-i]- 

1=1 


Each of the summands can be estimated independently using Monte Carlo: 


.no L . ri£ 

E d/i « - E/°P (i) ) + E - E (M* w ) - ft- 1 Ab) • (9- 14 ) 


n o " 

x=i 


np 

i= 1 i= i 


On the face of it, there appears to be no advantage to this decomposition 
of the Monte Carlo estimator, but this misses two important factors: the 
computational cost of evaluating is much lower for lower values of 

£, and the error of the Monte Carlo estimate for the £ th summand scales like 
y/Vplft — f£-i\/ri£. Therefore, if the ‘correction terms’ between successive 
fidelity levels are of low variance, then smaller sample sizes ri£ can be used 
for lower values of £. The MLMC estimator (9.14) is prototypical of a family 
of variance reduction methods for improving the performance of the nai’ve 
Monte Carlo estimator. 

One practical complication that must be addressed is that the domains of 
the functions fi and are usually distinct. Therefore, strictly speaking, the 
MLMC estimator (9.14) does not make sense as written. A more accurate 
version of (9.14) would be 


-j n 0 L-.ni 

Ed/] « -E/o A4 + E-E (M*™) -f t -i((xw) 

i=i t= i i=i 
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where the X^P are i.i.d. draws of the projection of the law of X onto the 
domain of /g, and we further assume that X^P can be ‘coarsened’ to be 
a valid input for /g_ i as well. (Equally well, we could take valid inputs for 
fi - 1 and assume that they can be ‘refined’ to become valid inputs for fy.) 
Naturally, these complications make necessary a careful convergence analysis 
for the MLMC estimator. 


9.6 Pseudo-Random Methods 


This chapter concludes with a brief survey of numerical integration methods 
that are in fact based upon deterministic sampling, but in such a way as the 
sample points ‘might as well be’ random. To motivate this discussion, observe 
that all the numerical integration schemes — over, say, [0, l} d with respect 
to uniform measure — that have been considered in this chapter are of the 
form 

P n 

/ f(x)dx^y^Wif(xi) 

•Ami" 


for some sequence of nodes x, e [0,1]“ and weights Wi ; for example, Monte 
Carlo integration takes the nodes to be independent uniformly distributed 
samples, and wi = -. By the Koksma-Hlawka inequality (Theorem 9.23 
below), the difference between the exact value of the integral and the result 
of the numerical quadrature is bounded above by the product of two terms: 
one term is a measure of the smoothness of /, independent of the nodes; the 
other term is the discrepancy of the nodal set, which can be thought of as 
measuring how non-uniformly-distributed the nodes are. 

As noted above in the comparison between Gaussian and Clenshaw-Curtis 
quadrature, it is convenient for the nodal set {aq, . . . ,x n } to have the prop- 
erty that it can be extended to a larger nodal set (e.g. with n + 1 or 2n 
points) without having to discard the original n nodes and their associated 
evaluations of the integrand /. The Monte Carlo approach clearly has this 
extensibility property, but has a slow rate of convergence that is independent 
both of the spatial dimension and the smoothness of /. Deterministic quadra- 
tures may or may not be extensible, and have a convergence rate better than 
that of Monte Carlo, but fall prey to the curse of dimension. What is desired 
is an integration scheme that somehow combines all the desirable features. 
One attempt to do so is a quasi-Monte Carlo method in which the nodes are 
drawn from a sequence with low discrepancy. 


Definition 9.20. The discrepancy of a finite set of points P C [0, l] d is 
defined by 


D(P) := sup 
BeJ 


#(PnB) 


A d (B) 
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where X d denotes d-dimensional Lebesgue measure, and J is the collection 
of all products of the form Yli=i [<T,5i)> with 0 < tR < bi < 1. The star 
discrepancy of P is defined by 


D*(P) 


sup 
Bej * 


#(PnB) 

#P 


A d (B) 


where J * is the collection of all products of the form ntiio, bi), with 

0 < bi < 1. 

It can be shown that, for general d E N, D*(P) < D(P) < 2 d D*(P). In 
dimension d = 1, the star discrepancy satisfies 


D*(x i, ...,x n ) 


sup 

[o,b)ej* 


• • • , %n} n [o,6)) 
n 


#{i | ^ < b} 

sup 

o<6<i n 


b 


Fx ~ id||oo, 


X\[0,b)) 


where Fx : [0,1] — )► [0,1] is the (left-continuous) cumulative distribution func- 
tion of the nodes x\,...,x n defined by 

F X (X) := #(< 1 < 

n 

Note that, when x\ < • • • < x n , 


F x (xi ) 


i — 1 


n 


for i = 1, 


n. 


and so, for ordered nodes Xi, 


D * (# 1 , . . . , x n ) = max max 

l<i<n 


i — 1 

Xi 

) 

i 

Xi 

\ 

n 


n 

J 


(9.15) 


(9.16) 


See Figure 9.4 for an illustration. 

Definition 9.21. Let /: [0, l} d R. If J C [0, l] d is a sub-rectangle of 
[0, l] d , i.e. a d-fold product of subintervals of [0,1], let Aj(f) be the sum 
of the values of / at the 2 d vertices of J, with alternating signs at nearest- 
neighbour vertices. The Vitali variation of /: [0, l] d R is defined to be 


V vit (f) := sup 



n is a partition of [0, l] d into finitely 
many non-overlapping sub-rectangles 
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Fig. 9.4: The star discrepancy of a finite set of nodes is the || • ||oo distance 
between the cumulative distribution function of the nodes (in black) and that 
of the uniform measure on [0, 1] (in grey). The set of five nodes shown has 
star discrepancy due to the placement of node X 4 . 


For 1 < s < d, the Hardy-Krause variation of / is defined to be 

V HK (f) :=^V vit (/| F ), 

F 

where the sum runs over all faces F of [0, l] d having dimension at most s. 

In dimension d = 1, the Vitali and Hardy-Krause variations are equal, and 
coincide with the usual notion of total variation of a function /: [0, 1] R: 

{ n 

For quasi-Monte Carlo integration over an interval, Koksma’s inequality 
provides an upper bound on the error of the quadrature rule in terms of the 
variation of the integrand and the discrepancy of the nodal set: 

Theorem 9.22 (Koksma). ///: [0,1] —> R has bounded variation , then , 
whenever x \, . . . , x n E [0, 1), 

n ,1 

l f( x ) dx 

n i = 1 J ° 

This bound is sharp in the sense that, for every x\,...,x n E [0,1), there 
exists /: [0, 1] R with V(f) = 1 such that 


<V(f)D*( Xl ,...,x n ). 


n E N and 

0 = Xq < X\ < • • • < x n — 1 
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1 

n 


n «l 

£/(*<)- / 

do 


/(x) dx 


D*(xi, . . . ,x n ). 


Proof. A short proof of Koksma’s inequality can be given using the Stieltjes 
integral: 


1 

n 


n «l 

Eh x 9 - / 

do 


/(x) dx 



f(x)d{F x 


— id)(x) 


- / CM*) 


x) d /(x) 


by integration by parts, since the boundary terms Fx(0) — id(0) and Fx( 1) — 
id(l) both vanish. The triangle inequality for the Stieltjes integral then yields 


1 

n 


n 


E-h x i)-/ f(x)dx 


2=1 


CM*) 


x) df{x) 


< \\F X ~ id||ooU(/) 

= D*(x u ...,x n )V(f). 


To see that Koksma’s inequality is indeed sharp, fix nodes xi, . . . , x n G 
[0, 1); for ease of notation, let x n+ i := 1. By (9.16), there is a node Xj such 
that either 


Fx(xj) - xj | = D*(xi, ...,x n ) (9-17) 


or 

|Fx(x i+ i) - Xj | = £>*(xi, . . . ,x n ). 

If \Fx(xj) — Xj | = Zd*(xi, . . . , x n ), then define /: [0, 1] — )► R by 


/(x) := I[x < x , 


1, if x < Xj, 
0, if x > Xj; 


Note that / has a single jump discontinuity of height 1 at Xj, and is constant 
either side of the discontinuity, so / is of bounded variation with V(f) = 1. 
Then 


1 

n 


2=1 


/ /(x) dx 

/ 0 

— 

3 ~ 1 

x 7 - 

n 



F(xj ) — Xj\ 


= D*(x 1 , ...,x n ) 


by (9.15) 
by (9.17) 

= D*{x\, . . .,x n ), is 

□ 


as claimed. The other case, in which \Fx(xj+i) — xj 
similar, with integrand f(x) := l[x < Xj + 1 ]. 
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The multidimensional version of Koksma’s inequality is the Koksma- 
Hlawka inequality: 

Theorem 9.23 (Koksma-Hlakwa) . Let f : [0, l] d — > R have bounded Hardy- 
Krause variation. Then, for any x\, ... ,x n £ [0, l) d , 


1 J 1 f 

- T _ / /CO da: 


< v HK (/)zr(a;i,...,a; n ). 


T/ws bound is sharp in the sense that, for every x \, . . . , x n £ [0, l) d and every 
£ > 0, there exists f: [0, l} d — )► R with V rHK (/) = 1 snc/i that 


1 

n 



f(x) dx 


> D*(x i, . . . , aqv) — £. 


There are several well-known sequences, such as those of Van der Corput, 
Halton, and Sobol 7 , with star discrepancy D*(pc \, . . . , x n ) < C(log n) d /n, 
which is conjectured to be the best possible star discrepancy. Hence, for 
quasi-Monte Carlo integration using such sequences, 


1 

n 



f(x) dx 


CV KK (f)(logn) d 

n 


It is essentially the number-theoretic properties of these sequences that ensure 
their quasi-randomness, equidistributedness and low discrepancy. 

Definition 9.24. The Van der Corput sequence in (0, 1) with base b £ N, 
b > 1, is defined by 

K 

:= Yd k (n)b- k ~\ 

k = o 

where n = dk( n )b k is the unique representation of n in base b with 

0 < d n (n) < b. The Halton sequence in (0, l) d with bases b\, . . . , bd £ N, each 
greater than 1, is defined in terms of Van der Corput sequences by 

t — (H b C VMi 

n ) ' 


In practice, to assure that the discrepancy of a Halton sequence is low, 
the generating bases are chosen to be pairwise coprime. See Figure 9.5 
(a-c) for an illustration of the Halton sequence generated by the prime (and 
hence coprime) bases 2 and 3, and Figure 9.5(d) for an illustration of why 
the coprimality assumption is necessary if the Halton sequence is to be even 
approximately uniformly distributed. 

Quasi-Monte Carlo quadrature nodes for uniform measure on [0, 1] can 
be transformed into quasi-Monte Carlo quadrature nodes for other measures 
in much the same way as for Monte Carlo nodes, e.g. by re-weighting or 
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Fig. 9.5: The first N points in the two-dimensional Halton sequence with base 
b G N 2 , for various TV, with shading as in Figure 9.3. Subfigure (d) illustrates 
the strong correlation structure when b has non-coprime components. 


by coordinate transformations, as in (9.11). Thus, for example, quasi-Monte 
Carlo sampling of Af(m , C) on can be performed by taking (x n ) ne ?q to be 
a low-discrepancy sequence in the d-dimensional unit cube 

x n :=m + C 1/2 (<P~ 1 (x nt i), . . . ,<P~ l (x n , d )) 

where <P: R [0, 1] denotes the cumulative distribution function of the stan- 
dard normal distribution, 


@(x) 



exp(—t 2 /2) d t 


1 

2 


^1 + erf 





— oo 
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Fig. 9.6: The Halton sequence from Figure 9.5(c), transformed to be a se- 
quence of approximate draws from the Gaussian distribution A/”(m, C) on M 2 


with m = 0 and C = 


5/4 

1/4 


1/4 

1 


with shading as in Figure 9.3. 


and tP 1 is its inverse (the probit function). This procedure is illustrated in 
Figure 9.6 for the Gaussian measure 


M 



5/4 1/4~|\ 

1/4 lJJ 


on M 2 . 
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9.8 Exercises 

Exercise 9.1. A quadrature rule Q with weights of both signs has the unde- 
sirable property that an integrand / can take strictly positive values every- 
where in the integration domain, yet have Q(f) = 0. Explicitly, suppose that 
Q has nodes m-\-n nodes xf , . . . , , x] - , . . . , x~ G [a, b] with corresponding 

weights . . . , > 0 and ref, . . . , w~ > 0, so that 

m n 

Q(f) = E w t ) + E W J f( x J )• 

i=l j = 1 

(a) Consider first the case nn — n = 1. Construct a smooth, strictly positive 
function /: [a, b] —> R with f(x i) = — irq and f(pc 2 ) = rei. Show that this 
/ has Q(/) = 0. 

(b) Generalize the previous part to general m, n > 1 to find a smooth, strictly 
positive function /: [a, 6] — R with Q(/) = 0. 

(c) Further generalize this result to multivariate quadrature for the approx- 
imate integration of / : A R with respect to / 1 , where and fi 

is a non- negative measure on A. 

Exercise 9.2 (Takahasi-Mori (tanh-sinh) Quadrature (Takahasi and Mori, 
1973/74)). Consider a definite integral over [—1,1] of the form f(x) dx. 
Employ a change of variables x = c p(t ) := tanh(| sinh(t)) to convert this to 
an integral over the real line. Let h > 0 and K G N, and approximate this 
integral over R using 2 K + 1 points equally spaced from — Kh to Kh to derive 
a quadrature rule 


f(x) dx S3 Qh, K (f) 


where Xk 


and Wk 


k—K 

E W kf{Xk), 

k— — K 

tanh(| sinh (kh)), 
Tjh cosh(/ch) 
cosh 2 ( ; | sinh (kh)) 


How are these nodes distributed in [—1,1]? Why is excluding the nodes Xk 
with \k\ > K a reasonable approximation? 

Exercise 9.3. Following Remark 9.15, find the interpolation basis associated 
with the Smolyak-Newton-Cotes quadrature rule in d G N variables at level 
t G N. 


Exercise 9.4. Implement the Metropolis-Hastings Markov chain Monte 
Carlo method and use it to sample the uniform measure on your favourite 
open subset E C R d , ideally a non-convex one, as in Example 9.19. Do the 
same for a density that is not an indicator function, but has non-convex 
superlevel sets, e.g. a bimodal convex combination of two Gaussian measures 
with distinct means. 
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Exercise 9.5. Let p be a probability measure on R with known probability 
density function p and cumulative distribution function F with known inverse 
F _1 . Sample p using 

(a) inversion of F, as in (9.11); 

(b) the Metropolis-Hastings MCMC method, as in Exercise 9.4. 

Use histograms/empirical cumulative distribution functions to compare the 
closeness of the sample distributions to p. 

Exercise 9.6. Using (9.16), generate Van der Corput sequences and produce 
numerical evidence for the assertion that they have star discrepancy at most 

^ i log n 
n 

Exercise 9.7. Consider, for k G N, the function f n : [0, l ] 2 — >• R defined by 

fk(x,y) ■= cos(2kirx) + {y-\), 

for which clearly fk(x,y) dxdy = 0. Integrate this function approxi- 

mately using 

(a) Gaussian quadrature; 

(b) Clenshaw-Curtis quadrature; 

(c) Monte Carlo; and 

(d) a Halton sequence. 

Compare accuracy of the results that you obtain as a function of V, the 
number of sample points used, and as a function of k. 


Chapter 10 

Sensitivity Analysis and Model 
Reduction 


Le doute n’est pas un etat bien agreable, mais 
h assurance est un etat ridicule. 


Voltaire 


The topic of this chapter is sensitivity analysis , which may be broadly 
understood as understanding how f(x i, . . . , x n ) depends upon variations not 
only in the X{ individually, but also combined or correlated effects among 
the Xi. There are two broad classes of sensitivity analyses: local sensitivity 
analyses study the sensitivity of / to variations in its inputs at or near a 
particular base point, as exemplified by the calculation of derivatives; global 
sensitivity analyses study the ‘average’ sensitivity of / to variations of its 
inputs across the domain of definition of /, as exemplified by the McDiarmid 
diameters and Sobol 7 indices introduced in Sections 10.3 and 10.4 respectively. 

A closely related topic is that of model order reduction , in which it is 
desired to find a new function /, a function of many fewer inputs than /, that 
can serve as a good approximation to /. Practical problems from engineering 
and the sciences can easily have models with millions or billions of inputs 
(degrees of freedom). Thorough exploration of such high-dimensional spaces, 
e.g. for the purposes of parameter optimization or a Bayesian inversion, is all 
but impossible; in such situations, it is essential to be able to resort to some 
kind of proxy / for / in order to obtain results of any kind, even though their 
accuracy will be controlled by the accuracy of the approximation / ~ f. 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-10 
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10.1 Model Reduction for Linear Models 

Suppose that the model mapping inputs x G C n to outputs y = f(pc) G C m 
is actually a linear map, and so can be represented by a matrix A G C mxn . 
There is essentially only one method for the dimensional reduction of such 
linear models, the singular value decomposition (SVD). 

Theorem 10.1 (Singular value decomposition). Every matrix A G C rnXn 
can be factorized as A = UEV* , where U G C mXm is unitary (i.e. U*U = 
UU* = I), V G C nXn is unitary, and E G is diagonal. Furthermore, 

if A is real, then U and V are also real. 

Remark 10.2. The existence of an SVD-like decomposition for an operator 
A between Hilbert spaces is essentially the definition of A being a compact 
operator (cf. Definition 2.48). 

The columns of U are called the left singular vectors of A; the columns 
of V are called the right singular vectors of A; and the diagonal entries of 
E are called the singular values of A. While the singular values are unique, 
the singular vectors may fail to be. By convention, the singular values and 
corresponding singular vectors are ordered so that the singular values form a 
decreasing sequence 


V ^ ’ E ^hnin {m,n} V 0. 

Thus, the SVD is a decomposition of A into a sum of rank-1 operators: 

min {m,n} min {m,n} 

A = U UV* = J2 VjUj ®Vj= J2 °jUj (vj , • )• 

3 = 1 J=1 

The singular values and singular vectors are closely related to the eigen- 
pairs of self-adjoint and positive semi-definite matrices A* A: 

(a) If m < n, then the eigenvalues of A* A are ... , o 2 m and n — m zeros, 
and the eigenvalues of A A* are o \ , . . . , cqy . 

(b) If m — n, then the eigenvalues of A* A and of A A* are . . . , of. 

(c) If m > n, then the eigenvalues of A* A are erf, . . . ,cr 2 and the eigenvalues 
of AA* are . . . , and m — n zeros. 

In all cases, the eigenvectors of A* A are the columns of V , i.e. the right 
singular vectors of A, and the eigenvectors of AA* are the columns of U, i.e. 
the left singular vectors of A. 

The appeal of the SVD is that it can be calculated in a numerically stable 
fashion (e.g. by bidiagonalization via Householder reflections, followed by a 
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variant of the QR algorithm for eigenvalues), and that it provides optimal 
low-rank approximation of linear operators in a sense made precise by the 
next two results: 

Theorem 10.3 (Courant-Fischer minimax theorem). For A E C mXn and a 
subspace E C C n , let 


A\ 


E 


sup 

xeE\{ o} 


\\Ax\\ : 

\\ x h 


(x,A*Ax) 1 / 2 

sup — 

xeE\{ 0} \\ x \\2 


be the operator 2-norm of A restricted to E. Then the singular values of A 
satisfy , for k = 1 , . . . , min {ra, n}, 


°k 


inf 


subspaces E s.t. 
codim E—k — 1 



2 


inf 


subspaces E s.t. 
codim E<k— 1 



2 ' 


Proof. Let A have SVD A = UEV* , and let t?i, . . . ,v n be the columns of 
V, i.e. the eigenvectors of A* A. Then, for any x E C n , 


x = 

3 = 1 


3 = 1 


n 

AT Ax = Oj (x, , 

i=i 


(x, A* Ax) = | (x, Vj ) | 2 

3 = 1 


Let E C C n have codim T 1 < fc — 1. Then the ^-dimensional subspace spanned 
by xi, . . . , Xfc has some x ^ 0 in common with E, and so 

k k 

(x,A*Ax) = ^2a 2 \(x,Vj)\ 2 > a 2 J2\(x,vp\ 2 = <T 2 k \\x\\ 2 . 

3 = 1 J =1 


Hence, < A\e for any E with codim E < k — 1. 

It remains only to find some E with codim E = k — 1 for which cr/c > A\e 
Take E := spanjxfc, . . . , v n }. Then, for any x E E, 


n 


n 


(x,A* Ax) = Erf K x ’ v i)| 2 - a lE K x ’ 
j=k j—k 


V' 


cr,J X 


which completes the proof. 


□ 


Theorem 10.4 (Eckart-Young low-rank approximation theorem). Given 
A E C mxr \ let Ak E C mxn the matrix formed from the first k singular 
vectors and singular values of A, i.e. 


k 

A k := ^^crjUj ® Vj. 
3 = 1 


( 10 . 1 ) 
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Then 


(7k+l = \\A — A } c || 2 


inf 

xec rnXri 

rank X<k 


\a-x\\ 2 . 


Hence, as measured by the operator 2-norm, 

(a) Ak is the best approximation to A of rank at most k; and 

(b) if A G C nXn , then A is invertible if and only if cr n > 0, and a n is the 
distance of A from the set of singular matrices. 


Proof. Let Mk denote the set of matrices in C mXn with rank < fc, and let 
X G A4k- Since rankX -f-dimkerX = n , it follows that codim kerX < k. By 
Theorem 10.3, 


. Ax 2 

<Jk+ 1 < sup || || 

x€E |F||2 

codim E<k 


Hence, 


Cfc + 1 < sup 

aKEker X 


\\Ax\\‘ 

\\x\\2 


sup 

ccGker X 


\\(A-X)x\\ 2 

IM | 2 


< \\a-x\\ 2 . 


Hence a k+ 1 < 'mi X eM k M - X\\ 2 . 

Now consider A k as given by (10.1), which certainly has rank A k < k. 
Now, 

r 

A - A k = ^ c jjUj (g) Vj , 

j=k + 1 

where r := rank A Write x G C n as x = ZUMv j. Then 


r 

(A- A k )x= 22 < J j u j ( v i » x ) > 

j=k+l 


and so 


r 

\{A-A k )x\\l = 22 a j\( v j^ x )\ 2 

j=k + 1 

<< j 2 k+1 22 i ( v j > x )\ 2 

j=k + 1 

< °k+ i||x || 2 


Hence, ||H - A k || 2 < a k + 1 - □ 

See Chapter 11 for an application of the SVD to the analysis of sam- 
ple data from random variables, a discrete variant of the Karhunen-Loeve 


10.2 Derivatives 


201 


expansion, known as principal component analysis (PCA). Simply put, when 
A is a matrix whose columns are independent samples from some stochastic 
process (random vector), the SVD of A is the ideal way to fit a linear structure 
to those data points. One may consider nonlinear fitting and dimensionality 
reduction methods in the same way, and this is known as manifold learning. 
There are many nonlinear generalizations of the SVD/PCA: see the bibliog- 
raphy for some references. 


10.2 Derivatives 


One way to understand the dependence of f(x i, . . . , x n ) upon xi, . . . , x n near 
some nominal point x = (xi, . . . , x n ) is to estimate the partial derivatives of 
/ at x, i.e. to approximate 


df_ 

dxi 



lim 

h—> 0 


f(x 1 , ■ ..,Xj+h,. 

h 




For example, for a function / of a single real variable x, and with a fixed step 
size h > o, the derivative of / at x may be approximated using the forward 
difference 


df 

dx 



f(x + h) - f(x) 
h 


or the backward difference 


„ tM - !Pl 

da; ~ h 



Similarly, the second derivative of / might be approximated using the second 
order central difference 



dx 2 



/( X + h)~ 2 f(x) + f(x 

h°- 



Ultimately, approximating the derivatives of / in this way is implicitly a poly- 
nomial approximation: polynomials coincide with their Taylor expansions, 
their derivatives can be computed exactly, and we make the approximation 
that f ~ p => f ~ p' . Alternatively, we can construct a randomized 
estimate of the derivative of / at x by random sampling of x near x (i.e. x 
not necessarily of the form x = x + hef), as in the simultaneous perturbation 
stochastic approximation (SPSA) method of Spall (1992). 

An alternative paradigm for differentiation is based on the observation that 
many numerical operations on a computer are in fact polynomial operations, 
so they can be differentiated accurately using the algebraic properties of 
differential calculus, rather than the analytical definitions of those objects. 
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A simple algebraic structure that encodes first derivatives is the concept of 
dual numbers, the abstract algebraic definition of which is as follows: 

Definition 10.5. The dual numbers M e are defined to be the quotient of the 
polynomial ring M[x] by the ideal generated by the monomial x 2 . 

In plain terms, M e = {x$ + x\e | xo, x\ E R}, where e ^ 0 has the property 
that e 2 = 0 (e is said to be nilpotent). Addition and subtraction of dual 
numbers is handled componentwise; multiplication of dual numbers is han- 
dled similarly to multiplication of complex numbers, except that the relation 
e 2 = 0 is used in place of the relation i 2 = —1; however, there are some addi- 
tional subtleties in division, which is only well defined when the real part of 
the denominator is non-zero, and is otherwise multivalued or even undefined. 
In summary: 


(cc 0 + Xie) + (y 0 + y\e) = (x 0 + y 0 ) + (zi + y i)e, 
(x 0 + xie) - (y 0 + yie) = (x 0 - y 0 ) + (xi - y i)e, 
(x 0 + x 1 e)(yo + 2/1 e) = x 0 y 0 + (x 0 yi + x 1 y 0 )e 


Xq + X\C 


( %o_ ypxi 

yo 


= < Xi 


y o +2/ 1 € 

undefined, 
A helpful representation of M e 


h Z6, 

y i 


- x 0 yi , 

~2 y o t o, 

2/o 

for any z E R if xo = yo = 0, 
if 2/0 = 0 and xq ^ 0. 

in terms of 2 x 2 real matrices is given by 


Xq + X\€ i > 


Xq 

X\ 

so that e < — > 

0 

1 

0 

X 0 


0 

0 


One can easily check that the algebraic rules for addition, multiplication, etc. 
in M e correspond exactly to the usual rules for addition, multiplication, etc. 
of 2 x 2 matrices. 


Automatic Differentiation. A useful application of dual numbers is auto- 
matic differentiation , which is a form of exact differentiation that arises as 
a side-effect of the algebraic properties of the nilpotent element e, which 
behaves rather like an infinitesimal in non-standard analysis. Given the al- 
gebraic properties of the dual numbers, any polynomial p(x) := po + pix + 
• • • + p n x n E M[x]< n , thought of as a function p: R -a R, can be extended to 
a function p: M e -A R e . Then, for any xq + x\e E R e , 
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n 

p(x 0 + xie) = ^2pk(x 0 + xie) k 

k=0 

= I '^2pk x o ) + (pixie + 2 p 2 XoXie -\ h np n x^~ 1 xie) 

\k = 0 / 

= p(x 0 ) +p\xo)xi€. 

Thus the derivative of a real polynomial at x is exactly the coefficient of e in its 
dual- number extension p(x+e). Indeed, by considering Taylor series, it follows 
that the same result holds true for any analytic function (see Exercise 10.1). 
Since many numerical functions on a computer are evaluations of polynomials 
or power series, the use of dual numbers offers accurate symbolic differenti- 
ation of such functions, once those functions have been extended to accept 
dual number arguments and return dual number values. Implementation of 
dual number arithmetic is relatively straightforward for many common pro- 
gramming languages such as C/C++, Python, and so on; however, technical 
problems can arise when interfacing with legacy codes that cannot be modi- 
fied to operate with dual numbers. 

Remark 10.6. (a) An attractive feature of automatic differentiation is that 
complicated compositions of functions can be differentiated exactly using 
the chain rule 

(/ ° 9 )'(x) = f'(g(x))g' (x) 

and automatic differentiation of the functions being composed. 

(b) For higher-order derivatives, instead of working in a number system for 
which e 2 = 0, one works in a system in which e 3 or some other higher 
power of e is zero. For example, to obtain automatic second derivatives, 
consider 

M e , e 2 = {xq + X\€ + X2C 2 I Xo, Xi,X2 £ M} 

with e 3 = 0. The derivative at x of a polynomial p is again the coefficient 
of e in p(x + e), and the second derivative is twice (i.e. 2! times) the 
coefficient of e 2 in p(x + e). 

(c) Analogous dual systems can be constructed for any commutative ring R , 
by defining the dual ring to be the quotient ring R[x\/(x 2 ) — a good 
example being the ring of square matrices over some field. The image of 
x under the quotient map then has square equal to zero and plays the 
role of e in the above discussion. 

(d) Automatic differentiation of vector- valued functions of vector arguments 
can be accomplished using a nilpotent vector e = (ei, . . . ,e n ) with the 
property that e^j =0 for all i, j E {1, . . . , n}; see Exercise 10.3. 


The Adjoint Method. A common technique for understanding the impact 
of uncertain or otherwise variable parameters on a system is the so-called 


204 


10 Sensitivity Analysis and Model Reduction 


adjoint method , which is in fact a cunning application of the implicit function 
theorem (IFT) from multivariate calculus: 

Theorem 10.7 (Implicit function theorem). Let X, y and Z be Banach 
spaces, let W C X x y be open, and let f E C k (W;Z ) for some k > 1. 
Suppose that, at (x,y) E W, the partial Frechet derivative ^(x,y): y -A Z 
is an invertible bounded linear map. Then there exist open sets U Cl about 
x, V C y about y, with U x V C FF, and a unique p E 5 V) stm/i that 

{(u 2 /) E U x U | f(x, y) = f(x, y)} = {(x, ?/) E U xV\y = cp(x)} , 

he. t/ie contour of f through (x,y) is locally the graph of Lp. Furthermore, U 
can be chosen so that ^(x, (p(x)) is boundedly invertible for all x E U, and 

the Frechet derivative ^ (x) : X -A y of ip at any x E U is the composition 

AMA’A (AA ■ <io - 2) 

We now apply the IFT to derive the adjoint method for sensitivity analysis. 
Let U and S be (open subsets of) Banach spaces. Suppose that uncertain 
parameters 0 E S and a derived quantity u E U are related by an implicit 
function of the form F(u, 0) = 0; to take a very simple example, suppose that 
u : [—1,1] -A R solves the boundary value problem 

--j- ( e e -^-u(x) J = (x — l)(x + 1), — 1 < x < 1, 

ax \ ax J 

u(x) = 0, x E {±1}. 

Suppose also that we are interested in understanding the effect of changing 
0 upon the value of a quantity of interest q : U x O -a R. To be more precise, 
the aim is to understand the derivative of q(u , 6) with respect to 0, with u 
depending on 6 via F(u, 0) = 0, at some nominal point (F, 6). 

Observe that, by the chain rule, 

+ !(«.*). ( 103 ) 

Note that (10.3) only makes sense if u can be locally expressed as a differen- 
tiable function of 6 near (u,6)\ by the IFT, a sufficient condition for this is 
that F is continuously Frechet differentiable near (F, 6) with qpq(u, 6) invert- 
ible. Using this insight, the partial derivative of the solution u with respect 
to the parameters 0 can be eliminated from (10.3) to yield an expression that 
uses only the partial derivatives of the explicit functions F and q. 

To perform this elimination, observe that the total derivative of F vanishes 
everywhere on the set of (u,0) EW x O such that F(u, 6) = 0 (or, indeed, on 
any level set of F), and so the chain rule gives 
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d F _ OF du | dF _ n 
" ~du de + ~oe =0 ' 

Therefore, since ^(u,Q) is invertible, 

dv - ( BF - \ _1 BF 

^ — (ia4) 

as in (10.2) in the conclusion of the IFT. Thus, (10.3) becomes 

^|(», 9) = (8,9)) f(M) + g(«, ,*>. (10-5) 

which, as desired, avoids explicit reference to 
Equation (10.4) can be re-written as 

§(,l,9) = Af(M) 

where the linear functional A E U' is the solution to 

x ^ s> = -^ {n - s) - (i °- 6) 
or, equivalently, taking the adjoint (conjugate transpose) of (10.6), 

which is known as the adjoint equation. This is a powerful tool for investi- 
gating the dependence of q upon 0, because we can now compute || without 
ever having to work out the relationship between 0 and u or its derivative 
^ explicitly — we only need partial derivatives of F and q with respect to 
9 and u, which are usually much easier to calculate. We then need only solve 
(10.6)/(10.7) for A, and then substitute that result into (10.5). 

Naturally, the system (10.6)/(10.7) is almost never solved by explicitly 
computing the inverse matrix; instead, the usual direct (e.g. Gaussian elimi- 
nation with partial pivoting, the QR method) or iterative methods (e.g. the 
Jacobi or Gauss-Seidel iterations) are used. See Exercise 10.4 for an example 
of the adjoint method for an ODE. 

Remark 10.8. Besides their local nature, the use of partial derivatives as 
sensitivity indices suffers from another problem well known to students of 
multivariate differential calculus: a function can have well-defined partial 
derivatives that all vanish, yet not be continuous, let alone locally constant. 
The standard example of such a function is / : M 2 R defined by 
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f(x,y) 


xy 


x z + y* 


0, 


if (x,y) ± (0,0), 

if (x,y) = (0,0). 


This function / is discontinuous at (0,0), since approaching (0,0) along the 
line x = 0 gives 


lim f(x,y) 

x—0 

o 


lim /( 0, y) = lim 0 = 0 

>■ 0 >■ 0 


but approaching (0, 0) along the line x = y gives 


or 1 


lim f(x, y) = lim — = - / 0. 
y=x^0 J V x^O 2X 2 2 ^ 


However, / has well-defined partial derivatives with respect to x and y at 
every point in M 2 , and in particular at the origin: 


S>1 

dx 


(x,y) 


df_ 

dy 


(x,y) 


^ 9 

y — x y 

( x 2 + y 2 ) 2 ’ 

0, 


S 2 

x — xy 
( x 2 + y 2 ) 2 ’ 

0, 


if (x,y) ^ (0,0), 

if (x,y) = (0,0), 

if (x,y) + (0,0), 
if (x,y) = (0,0). 


Such pathologies do not arise if the partial derivatives are themselves contin- 
uous functions. Therefore, before placing much trust in the partial derivatives 
of / as local sensitivity indices, one should check that / is C 1 . 


10.3 McDiarmid Diameters 

Unlike the partial derivatives of the previous section, which are local measures 
of parameter sensitivity, this section considers global C L°°- type’ sensitivity 
indices that measure the sensitivity of a function of n variables or parameters 
to variations in those variables/parameters individually. 

Definition 10.9. The i th McDiarmid subdiameter of /: X := YI --1 -A K 
is defined by 

Vi[f] := sup{|/(x) - f(y)\\x,y e A” such that xj = yj for j ^ i}; 
equivalently, Di[f] is 


I f(x) - f(x 1 , 


Xi— 1 , X ^ , Xi-^- \ , . . . , Xix) 


X = 


(xi, . . . ,x n ) £ X 
and x'i e Xi 


sup 
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The McDiarmid diameter of / is 


m ■■= 


n 


N 




Remark 10.10. Note that although the two definitions of Di[f\ given above 
are obviously mathematically equivalent, they are very different from a com- 
putational point of view: the first formulation is ‘obviously’ a constrained 
optimization problem in 2 n variables with n — 1 constraints (i.e. ‘difficult’), 
whereas the second formulation is ‘obviously’ an unconstrained optimization 
problem in n + 1 variables (i.e. ‘easy’). 

Lemma 10.11. For each j = 1, . . . , n, Vj[ • ] is a seminorm on the space of 
bounded functions f: T K, as is T>[-]. 

Proof. Exercise 10.5. □ 


The McDiarmid subdiameters and diameter are useful not only as sensi- 
tivity indices, but also for providing a rigorous upper bound on deviations of 
a function of independent random variables from its mean value: 

Theorem 10.12 (McDiarmid’s bounded differences inequality). Let X = 
(Xi,...,X n ) be any random variable with independent components taking 
values in X = Yl -=i an & let f : X ^ R be absolutely integrable with 
respect to the law of X and have finite McDiarmid diameter T>[f]. Then, for 
any t >0, 


f 2 1 2 \ 

p [f(X) > E[f(X) } +t]< exp l -—\ , (10.8) 

P[/P0 < nf(X)] -t}< exp (-fSjn) , (10.9) 

V[\f(X)-E[f(X)]\ >t] <2expf-^V (10.10) 

Corollary 10.13 (Hoeffding’s inequality). Let X = (Xi,...,X n ) be a 
random variable with independent components, taking values in the cuboid 
n?=iM- Let S n '■= ^ Then, for any t> 0, 


Pfe n -E[S n 


> t\ < exp (- 


—2 n 2 t 2 


E?=i(&i-Oi) 2 


and similarly for deviations below, and either side, of the mean. 

McDiarmid’s and Hoeffding’s inequalities are just two examples of a 
broad family of inequalities known as concentration of measure inequalities. 
Roughly put, the concentration of measure phenomenon, which was first 
noticed by Levy (1951), is the fact that a function of a high-dimensional 
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random variable with many independent (or weakly correlated) components 
has its values overwhelmingly concentrated about the mean (or median). An 
inequality such as McDiarmid’s provides a rigorous certification criterion: to 
be sure that f(X) will deviate above its mean by more than t with probability 
no greater than £ E [0, 1], it suffices to show that 


i.e. 

m s ( vtA 

Experimental effort then revolves around determining E [f(X)\ and T>[f]\ 
given those ingredients, the certification criterion is mathematically rigor- 
ous. That said, it is unlikely to be the optimal rigorous certification criterion, 
because McDiarmid’s inequality is not guaranteed to be sharp. The calcula- 
tion of optimal probability inequalities is considered in Chapter 14. 

To prove McDiarmid’s inequality first requires a lemma bounding the 
moment-generating function of a random variable: 

Lemma 10.14 (Hoeffding’s lemma). Let X be a random variable with mean 
zero taking values in [a, b ] . Then, for t >0, 



< exp 



Proof. By the convexity of the exponential function, for each x E [a, b\, 


f b x y. x a j-l 

e l < e l H e . 

b — a b — a 

Therefore, applying the expectation operator, 

b + „ a 


E 


tx 


< 


b — a 


e ta + 


b — a 


e tb = 


Observe that </>(0) = 0, <f>'( 0) = 0, and <p"(t) < Ub — a) 2 . Hence, since exp is 
an increasing and convex function, 

E 


ax 


( ( b — a) 2 t 2 \ (t 2 (b — a ) 2 

< exp ( 0 + Ot + - — — ) = exp 


□ 


We can now give the proof of McDiarmid’s inequality, which uses Ho- 
effding’s lemma and the properties of conditional expectation outlined in 
Example 3.22. 
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Proof of McDiarmid’s inequality (Theorem 10.12). Let be the 

a- algebra generated by Xi , . . . , Xi, and define random variables Zo, . . . , Z n 
by Zi := E[/(X)|J^]. Note that Zo = K[f(X)] and Z n = f(X). Now consider 
the conditional increment (Zi — Z^_i)|j^_i. First observe that 


E[Z^ — i|^i_ i] — 0, 

so that the sequence (Z^> o is a martingale. Secondly, observe that 

Li < (Zi — Z{- 1 ZZi- 1 ) < Ui , 


where 


Li := inf E[/(X)|^i-!, X, = ^ - E[/(X)| 

Ui :=supE[/(X)|^ i _ 1 ,X i = ix]-E[/(X)|^i_ 1 ]. 

u 


Since Ui — Li < X^[/], Hoeffding’s lemma implies that 


E 


gS(Zi-Zi-i) 





Hence, for any s > 0, 


(10.11) 


P[/(X) - E[/(X)] > i] 

= p r e s(/(X)-E[/(X)]) > e st 
Af(X)- E[/(X)]) 
P s Er=i Zi-Zi-x 


< e _st E 


e _st E 


= e _st E 
= e _st E 


E 


» s Er=i 




^ i 

n — 1 
S ( ^77 — 1 ) 


^ n— 1 


since Zq, . . . , Z n _i are J^n-i-measurable, and 


by Markov’s ineq. 
as a telescoping sum 

by the tower rule 


g e s ^ jpi 


sEILT 1 ^-^- 


by (10.11). Repeating this argument a further n — 1 times shows that 

\ 


P[/(X) - E [/(*)] > i] < exp ( + -1 ?[/] 2 


(10.12) 


The right-hand side of (10.12) is minimized by s = 4£/P[/] 2 , which yields 
McDiarmid’s inequality (10.8). The inequalities (10.9) and (10.10) follow 
easily from (10.8). □ 
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10.4 ANOVA/HDMR Decompositions 


The topic of this section is a variance-based decomposition of a function 
of n variables that goes by various names such as the analysis of variance 
(ANOVA), the functional ANOVA , the high- dimensional model representa- 
tion (HDMR), or the integral representation. As before, let (Xu Hi) be a 
probability space for i = l,...,n, and let (T,J^,/i) be the product space. 
Write J\f := {1, . . . , n}, and consider a (^-measurable) function of interest 
f \ X -A R. Bearing in mind that in practical applications n may be large 
(10 3 or more), it is of interest to efficiently identify 

• which of the X{ contribute in the most dominant ways to the variations 
in f(x i, . . . ,x n ), 

• how the effects of multiple x\ are cooperative or competitive with one 
another, 

• and hence construct a surrogate model for / that uses a lower-dimensional 
set of input variables, by using only those that give rise to dominant 
effects. 

The idea is to write /(or, . . . , x n ) as a sum of the form 

n 

f(x 1 ,...,X n ) = f 0 +^2f{i}(Xi) + ^ + • • • (10.13) 

2=1 i 

= J2 r{xi)- 

ICAT 

Experience suggests that Typical real-world systems’ / exhibit only low-order 
cooperativity in the effects of the input variables x \ , . . . , x n . That is, the terms 
fi with \I\ 1 are typically small, and a good approximation of / is given 

by, say, a second-order expansion, 

n 

/(xi,...,a: n ) « / 0 + y^/{;}(zi) + ^2 f{hj}( x i’ x V- 

i=i i<y<jT n 

Note, however, that low-order cooperativity does not necessarily imply that 
there is a small set of significant variables (it is possible that is large 
for most i E {l,...,n}), nor does it say anything about the linearity or 
non-linearity of the input-output relationship. Furthermore, there are many 
HDMR- type expansions of the form given above; orthogonality criteria can 
be used to select a particular HDMR representation. 

Recall that, for I C A/", the conditional expectation operator 


/ E^ [f(xi, ■ ■ .,x n )\xi,i e I] 



f(x i, • • .,x n )d^fj,i(xi) 
iei 


is an orthogonal projection operator from L 2 (X . fi: Pi) to the set of square- 
integrable measurable functions that are independent of Xi for i G /, i.e. that 
depend only on Xi for i E Af \ I. Let 
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P 0 f:=E^[f] 

and, for non-empty I C A/", 

Pif :=E fl [f(x 1 ,...,x n )\x i ,i i I]- ^2 p jf ■ 

JC / 

The functions fj := Pjf provide a decomposition of / of the desired form 
(10.13). By construction, we have the following: 

Theorem 10.15 (ANOVA). For each i C TV", the linear operator Pi is an 
orthogonal projection of L 2 (X, fi\ M) onto 


F, := 



/ is independent of Xj for j f I 
and, for i el, f* f(x) d Hifxf) = 0 


c l 2 (x,h-,m.). 


Furthermore, the linear operators Pi are idempotent, commutative and 
mutually orthogonal, i.e. 


Pi Pjf = PjPif 


Pif, ifI = J, 
0, ifljU, 


and form a resolution of the identity: 

Y, p if = f- 

ICAT 


Thus , A 2 (A, (i ; R) = 0 /CA/ - * s an orthogonal decomposition of A 2 (A, /a; R) ; 
so ParsevaTs formula implies the following decomposition of the variance 
a 2 := \\f-P 0 f\\h M off: 

a 2 =J2 T ( 10 - 14 ) 

ICV 

where 


^0 0; 

<4 := [ (Pif )(x) 2 dn(x). 

J X 

Two commonly used ANOVA/HDMR decompositions are random sam- 
pling HDMR , in which pi is uniform measure on [0, 1], and Cut-HDMR, in 
which an expansion is performed with respect to a reference point x E A, 
i.e. p is the unit Dirac measure 5%: 
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fef(x) = f(x), 

f (x 1 5 • • • , X{ — 1 , Xi , ^i-j-1 7 • • • 7 ^n) «/*0 (*^') 

(^) f (x\ , , Xi — \ , , •1'Z-f- 1 7 * * * 7 Xj—\ , Xj , X j 

- Z{i}(^) - /{j}0) - f®( X ) 


7 ) 


Note that a component function fj of a Cut-HDMR expansion vanishes at 
any x E A that has a component in common with x, i.e. 

//(#) = 0 whenever x\ = ah for some i G /. 


Hence, 

fi(x)fj(x) = 0 whenever Xk = Xk for some k G I U J. 

Indeed, this orthogonality relation defines the Cut-HDMR expansion. 

Sobol 7 Sensitivity Indices. The decomposition of the variance (10.14) 
given by an HDMR/ANOVA decomposition naturally gives rise to a set of 
sensitivity indices for ranking the most important input variables and their 
cooperative effects. An obvious (and naive) assessment of the relative imp- 
ortance of the variables xi is the variance component cr|, or the normalized 
contribution cf 2 / cf 2 . However, this measure neglects the contributions of those 
xj with J C /, or those xj such that J has some indices in common with /. 
With this in mind, Sobol 7 (1990) defined sensitivity indices as follows: 

Definition 10.16. Given an HDMR decomposition of a function f of n 
variables, the lower and upper Sobol' sensitivity indices of I C J\f are, 
respectively, 

t] := Y 0J, and : = T a J- 
jci Jni^0 

The normalized lower and upper Sobol' sensitivity indices of I C J\f are, 
respectively, 

2 2/2 j — 2 — 2/2 

Sj := Tj/cr , and Sj := r I / cr . 

Since X^/caW/ = <j2 = 11/ — / 0 III 27 it follows immediately that, for each 
/cjV, 

0 < S/ < S? < 1 . 

Note, however, that while Theorem 10.15 guarantees that a 2 = J2icjv a w 
general Sobol 7 indices satisfy no such additivity relation: 

1 + Y -1 < T ^ L 

/ CA/" /CAT 
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The decomposition of variance (10.14), and sensitivity indices such as the 
Sobol 7 indices, can also be used to form approximations to / with lower- 
dimensional input domain: see Exercise 10.8. 


10.5 Active Subspaces 


The global sensitivity measures discussed above, such as Sobol 7 indices and 
McDiarmid diameters, can be used to identify a collection of important input 
parameters for a given response function. By way of contrast, the active 
subspace method seeks to identify a collection of important directions that 
are not necessarily aligned with the coordinate axes. 

In this case, we take as the model input space X = [— 1, l] n C M n , and 
f:X -A R is a function of interest. Suppose that, for each x G X, both 
f(x) G R and V/(x) G M n can be easily evaluated — note that evaluation of 
V/(x) might be accomplished by many means, e.g. finite differences, auto- 
matic differentiation, or use of the adjoint method. Also, let X be equipped 
with a probability measure fi. Informally, an active subspace for / will be a 
linear subspace of M n for which / varies a lot more on average (with respect 
to fi) along directions in the active subspace than along those in the comple- 
mentary inactive subspace. 

Suppose that all pairwise products of the partial derivatives of / are inte- 
grate with respect to fi. Define C = C(V/, p) G M nXn by 

C ■= Ex., [(V/(V))(V/(V)) T ] . (10.15) 

Note that C is symmetric and positive semi-definite, so it diagonalizes as 

c = waw t , 


where W G M nXn is an orthogonal matrix whose columns w i, . .. , w n are 
the eigenvectors of C, and A G M nXn is a diagonal matrix with diagonal 
entries Ai > • • • > A n > 0, which are the corresponding eigenvalues of C. 
A quick calculation reveals that the eigenvalue \ is nothing other than the 
mean-squared value of the directional derivative in the direction wp 


A, = wJCwi = wjEP(Vf)(\7f) T ] Wi = E 4(V/ -m,) 2 


(10.16) 


In general, the eigenvalues of C may be any non- negative reals. If, however, 
some are clearly c large’ and some are ‘small’, then this partitioning of the 
eigenvalues and observation (10.16) can be used to define a new coordinate 
system on M n such that in some directions / values ‘a lot’ and on others it 
varies ‘only a little’. More precisely, write A and W in block form as 
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A = 


A 1 0 
0 A 2 


and W = 


Wx W 2 


(10.17) 


where A\ G M. kxk and W\ G R nXk with k < n; of course, the idea is that 
k <C n, and that A& A^+i. This partitioning of the eigenvalues and eigen- 
vectors of C defines new variables y G and z G by 

y := and 2 := Wjx. (10.18) 


so that x = W\y + W 2 z. Note that the (y,z) coordinate system is simply 
a rotation of the original x coordinate system. The /c-dimensional subspace 
spanned by wi, . . . , Wk is called the active subspace for / over X with respect 
to fi. The heuristic requirement that / should vary mostly in the directions 
of the active subspace is quantified by the eigenvalues of C: 

Proposition 10.17. The mean-squared gradients of f with respect to the 
active coordinates y G and inactive coordinates z G satisfy 


E4(V y /) T (V,/) 

E4(V Z /) T (V Z /) 


Ai + • • • + A*-, 
^fc+l + ' ' ' + A n . 


Proof. By the chain rule, the gradient of f(x) = f(Wiy + W 2 z) with respect 
to y is given by 


Vyf(x) = V y f(W iy + W 2 z) 

= W?V x f(W iy + W 2 z) 
= W?V x f(x). 


Thus, 


[(V,/) T (V y /)] = E p [tr((V y /)(V y /) T )] 

= trE4(V,/)(V y /) T ] 

= tr(W 1 r E /Ll [(V x /)(V x /) T ]m) 
= tr(WgCWi) 

= tr A\ 

— Ai + • • • + Afc. 


This proves the claim for the active coordinates y G the proof for the 
inactive coordinates z G is similar. □ 

Proposition 10.17 implies that a function for which A^+i = • • • = A n = 0 
has \7 z f = 0 /i-almost everywhere in X. Unsurprisingly, for such functions, 
the value of / depends only on the active variable y and not upon the inactive 
variable z: 
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Proposition 10.18. Suppose that fi is absolutely continuous with respect to 
Lebesgue measure on X, and suppose that f:X -A R is such that \k+i = 
• • • = A n = 0 . Then, whenever x\,x 2 G X have equal active component, i.e. 
Wjx i = Wjx 2, it follows that f(x i) = f(x 2) and V x f(x 1) = V cc /(x 2). 

Proof. The gradient V z f being zero everywhere in X implies that f(x 1) = 
f(x 2). To show that the gradients are equal, assume that x\ and x 2 lie in the 
interior of T. Then for any v G M n , let 

x[ = x\ + hv , and ^2 = ^2 + hv, 

where h G R is small enough that and x' 2 he in the interior of T. Note 
that Wj x[ = Wj x' 2 , and so f(x[) = f(x 2 ). Then 

c = v • (V x f(x 1) - V x f(x 2 )) 

= lim (/K) - /bi)) - ( 7 ( 4 ) - mV) 

h — )-0 h 

= 0 . 

Simple limiting arguments can be used to extend this result to x\ or X2 G dX. 
Since v G M n was arbitrary, it follows that V X f(x 1) = V x f(x 2). □ 

Example 10.19. In some cases, the active subspace can be identified exactly 
from the form of the function /: 

(a) Suppose that / is a ridge function , i.e. a function of the form f(x) := 
h(a ■ x), where h: R -A R and a G M n . In this case, C has rank one, and 
the eigenvector defining the active subspace is w\ = a/||a||, which can be 
discovered by a single evaluation of the gradient anywhere in X. 

(b) Consider f(x) := h(x-Ax ), where h: R -A R and A G M nXn is symmetric. 
In this case, 

C = 4 J 4E[(/i , ) 1 2 a;a; T ]^4 T , 

where h' = h' (x-Ax) is the derivative of h. Provided h' is non-degenerate, 
ker C = ker A. 

Numerical Approximation of Active Subspaces. When the expected 
value used to define the matrix C and hence the active subspace decom- 
position is approximated using Monte Carlo sampling, the active subspace 
method has a nice connection to the singular value decomposition (SVD). 
That is, suppose that x^\ . . . , x^ M ^ are M independent draws from the prob- 
ability measure fi. The corresponding Monte Carlo approximation to C is 

1 M 

C ~ C: =mT V/A m ))V/(aVy T . 
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The eigendecomposition of C as C = IT /111' T can be computed as before. 
However, if 


G := 


Cm l 


v/A 1 )) 


V/(x( M )) 


G 


>n x M 


then C = GG T , and an SVD of G is given by G = WA 1 ^ 2 V T for some 
orthogonal matrix V. In practice, the eigenpairs W and A from the finite- 
sample approximation C are used as approximations of the true eigenpairs 
W and A of C. 

The SVD approach is more numerically stable than an eigendecomposition, 
and is also used in the technique of principal component analysis (PCA). 
However, PCA applies the SVD to the rectangular matrix whose columns 
are samples of a vector-valued response function, and posits a linear model 
for the data; the active subspace method applies the SVD to the rectangular 
matrix whose columns are the gradient vectors of a scalar-valued response 
function, and makes no linearity assumption about the model. 

Example 10.20. Consider the Van der Pol oscillator 


u(t) — p(l — u{t) 2 )u(t) + uj 2 u(t) = 0, 


with the initial conditions 'u(O) = 1, A(0) = 0. Suppose that we are interested 
in the state of the oscillator at time T := 2tt; if uj = 1 and p = 0, then 
u(T) = u(0) = 1. Now suppose that uj ~ Unif([0.8, 1.2]) and /i ~ Unif([0, 5]); 
a contour plot of u(T) as a function of uj and p is shown in Figure 10.1(a). 

Sampling the gradient of u(T) with respect to the normalized coordinates 


x\ := 2 


uj — 0.8 
1.2 - 0.8 


1 C [ 1 , 1] 


p 


x 2 :=2^-le [- 1 , 1 ] 

5 


gives an approximate covariance matrix 


E[V^(T)(V^(T)) t ] 


1.776 -1.389 
-1.389 1.672 


which has the eigendecomposition C = W AW T with 


A = 


3.115 

0 


0 

0.3339 


and 


W = 


0.7202 

-0.6938 


0.6938 

0.7202 


Thus — at least over this range of the uj and p parameters — this system has 
an active subspace in the direction w\ = (0.7202, —0.6938) in the normalized 
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uj 



Wi ■ X 


Fig. 10.1: Illustration of Example 10.20. Subfigure (a) shows contours of the 
state at time T = 2tt of a Van der Pol oscillator with initial state 1.0 and 
velocity 0.0, as a function of natural frequency uj and damping fi. This system 
has an active subspace in the (0.144,-1.735) direction; roughly speaking, 
‘most’ of the contours are perpendicular to this direction. Subfigure (b) shows 
a projection onto this directions of 1000 samples of n(T), with uniformly 
distributed uj and /i, in the style of Exercise 10.9; this further illustrates the 
almost one-dimensional nature of the system response. 
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^-coordinate system. In the original (cj, /^-coordinate system, this active sub- 
space lies in the (0.144, —1.735) direction. 

Applications of Active Subspaces. The main motivation for determining 
an active subspace for /: A -a R is to then approximate / by a function F 
of the active variables alone, i.e. 

f(x) = f(W iy + W 2 z)^F(W iy ). 

Given such an approximation, F o W\ can be used as a proxy for / for the 
purposes of optimization, optimal control, forward and inverse uncertainty 
propagation, and so forth. 
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10.7 Exercises 

Exercise 10.1. Consider a power series f(x) := a n^ n , thought of as 

a function / : R R, with radius of convergence R. Show that the extension 
/ : M e M e of / to the dual numbers satisfies 

f(x + e) = f(x) + f'(x)e 

whenever \x\ < R. Hence show that, if g: R R is an analytic function, then 
g\x) is the coefficient of e in g{x + e). 

Exercise 10.2. An example partial implementation of dual numbers in 
Python is as follows: 

class DualNumber (object) : 
def init (self, r, e) : 

# Initialization of real and infinitesimal parts, 
self . r = r 

self.e = e 
def repr (self) : 

# How to print dual numbers 

return str(self.r) + " + " + str (self.e) + " * e" 
def add (self, other): 

# Overload the addition operator to allow addition of 

# dual numbers . 

if not isinstance (other , DualNumber): 

new_other = DualNumber (other , 0) 
else : 

new_other = other 
r_part = self.r + new_other.r 
e_part = self.e + new_other.e 
return DualNumber (r_part , e_part) 

Following the template of the overloaded addition operator, write anal- 
ogous methods def sub (self, other), def mul (self, other), 

and def div (self, other) for this DualNumber class to overload the 

subtraction, multiplication and division operators. The result will be that any 
numerical function you have written using the standard arithmetic operations 
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+, *, and / will now accept DualNumber arguments and return DualNumber 

values in accordance with the rules of dual number arithmetic. 

Once you have done this, the following function will accept a function f 
as its argument and return a new function f _prime that is the derivative of 
f , calculated using automatic differentiation: 

def AutomaticDerivative (f ) : 

# Accepts a function f as an argument and returns a new 

# function that is the derivative of f, calculated using 

# automatic differentiation, 
def f_prime(x): 

f_x_plus_eps = f (DualNumber (x, 1)) 
deriv = f _x_plus_eps . e 
return deriv 
return f_prime 

Test this function using several functions of your choice, and verify that it 
correctly calculates the derivative of a product (the Leibniz rule), a quotient 
and a composition (the chain rule). 

Exercise 10.3. Let /: M n -A M m be a polynomial or convergent power series 

f(x) =^2c a x a 

a 

in x = (pc \, . . . , x n ), where a = (aq, . . . , a n ) G Nq are multi-indices, c a G M m , 
and 0C • — CC * 00 ^ n . Consider the dual vectors over M n obtained by adjoin- 

ing a vector element e = (ei, . . . , e n ) such that e^Cj = 0 for alii, j G {1, . . . , n}. 
Show that 

n 

f(x + e) = c a ^2 ctiX a ~ ei ei 

a i — 1 

and hence that §£:(x) is the coefficient of in f(x + e). 

Exercise 10.4. Consider an ODE of the form ii(t) = f(u(t);6) for an un- 
known u(t) G R, where 6 G R is a vector of parameters, and /: M 2 -A R is 
a smooth vector field. Define the local sensitivity of the solution u about a 
nominal parameter value 0* G R to be the partial derivative s := (^*)- 

Show that this sensitivity index s evolves according to the adjoint equation 

s(t) = + %( u X d *)’ 9 *)- 

Extend this result to a vector- valued unknown u(t), and vector of parameters 

0 = (0i,... A). 

Exercise 10.5. Show that, for each j = 1, . . . , n, the McDiarmid subdiam- 
eter T>j[-] is a seminorm on the space of bounded functions / : X -a K, as is 
the McDiarmid diameter T>[-]. What are the null-spaces of these seminorms? 
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Exercise 10.6. Define, for constants a, 6 , c, d G R, /: [0, l ] 2 — M by 

f(x 1 , 0 : 2 ) := a + 6 x 1 + CX 2 + dx\X 2 . 

Show that the ANOVA decomposition of / (with respect to uniform measure 
on the square) is 


/{ i}(^i) = i b + Dpi - I), 

f{ 2 }(x 2 ) = (c+ f ) p 2 - |), 
/{1, 2} (^1,^2) = dpi - 5)^2 - |). 


Exercise 10.7. Let /: [—1, l] 2 -7 M be a function of two variables. Sketch 
the vanishing sets of the component functions of / in a Cut-HDMR expansion 
through x = (0, 0). Do the same exercise for /: [— 1, l] 3 R and x = (0,0,0), 
taking particular care with second-order terms like /*{ 1 , 2 }* 

Exercise 10.8. For a function /: [0, l] n R with variance cr 2 , suppose that 
the input variables of / have been ordered according to their importance in 
the sense that cr 2 -^ > <j 2 2 | > • • • > cr 2 n | > 0. The truncation dimension of / 
with proportion a G [0, 1] is defined to be the least d t = d t (a) G {1, . . . , n) 
such that 

E - aa 2 > 

0//C{l,...,d t } 

i.e. the first d t inputs explain a proportion a of the variance of /. Show that 


fd t (x) ■■= ^2 fi(xi) 

1 1 2 

is an approximation to / with error | / — /d t || L2 < (1 — ot)cr 2 . Formulate 
and prove a similar result for the superposition dimension d s , the least d s = 
d s (a) G {1, . . . , n} such that 


<X? > aa 2 , 

0//C{l,...,n} 

#I<d s 

Exercise 10.9. Building upon the notion of a sufficient summary plot 
developed by Cook (1998), Constantine (2015, Section 1.3) offers the fol- 
lowing “quick and dirty” check for a one-dimensional active subspace for 
/: [ — 1,1]^ — ^ M that can be evaluated a limited number — say, M — times 
with the available resources: 

(a) Draw M samples x\ G [— 1, l] n according to some probability 
distribution on the cube, e.g. uniform measure. 

(b) Evaluate f(x m ) for m = 1, . . . , M. 
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(c) Find (ao, a 1 , . . . , a n ) E R 1+ " to minimize 


J{a) 




f(x 1 ) 
f(x n ) 


2 


2 


is minimal. Note that this step can be interpreted as forming a linear 
statistical regression model. 

(d) Let a' := (ai, . . . , a n ), and define a unit vector w E M n by w := a 7 / Ha'll 2 - 

(e) Produce a scatter plot of the points (w • x m , f(x m )) for m = 1, . . . , M. 
If this scatter plot looks like the graph of a single-valued function, then 
this is a good indication that / has a one-dimensional active subspace in 
the w direction. 

One interpretation of this procedure is that it looks for a rotation of the 
domain [— 1, l] n such that, in this rotated frame of reference, the graph of / 
looks ‘almost’ like a curve — though it is not necessary that / be a linear 
function of w-x. Examine your favourite model / for a one-dimensional active 
subspace in this way. 


Chapter 11 

Spectral Expansions 


The mark of a mature, psychologically 
healthy mind is indeed the ability to live with 
uncertainty and ambiguity, but only as much 
as there really is. 


Julian Baggini 


This chapter and its sequels consider several spectral methods for uncer- 
tainty quantification. At their core, these are orthogonal decomposition 
methods in which a random variable stochastic process (usually the solution 
of interest) over a probability space (O, fi) is expanded with respect to an 
appropriate orthogonal basis of L 2 ((9, /i; M). This chapter lays the foundations 
by considering spectral expansions in general, starting with the Karhunen- 
Loeve bi- orthogonal decomposition , and continuing with orthogonal polyno- 
mial bases for L 2 ((9, /i; R) and the resulting polynomial chaos decompositions. 
Chapters 12 and 13 will then treat two classes of methods for the determi- 
nation of coefficients in spectral expansions, the intrusive and non-intrusive 
approaches respectively. 


11.1 Karhunen— Loeve Expansions 

Fix a domain A C (which could be thought of as ‘space’, ‘time’ or a 
general parameter space) and a probability space ((9,d^,/i). The Karhunen- 
Loeve expansion of a square-integrable stochastic process U : X x O -a R 
is a particularly nice spectral decomposition, in that it decomposes U in a 
bi-orthogonal fashion, i.e. in terms of components that are both orthogonal 
over the spatio-temporal domain A and the probability space O. 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-11 
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To be more precise, consider a stochastic process U: such that 

• for all x £ A, U(x) £ L 2 (<9,/i;M); 

• for all x G A, E n[U(x)] = 0; 

• the covariance function Cu(x,y) := 1 E fl [U(x)U(y)\ is a well-defined con- 
tinuous function of x,y G A. 

Remark 11.1. (a) The condition that U is a zero-mean process is not a 
serious restriction; if U is not a zero-mean process, then simply consider 
U defined by U(x, 0) := U(x, 0) - E „[U{x)\. 

(b) It is common in practice to see the covariance function interpreted as 
providing some information on the correlation length of the process U . 
That is, Cu(x,y) depends only upon \\x — y || and, for some function 
g : [0, oo ) —x [0, oo), Cu(x,y) = g{\\x — y\\). A typical such g is g(r) = 
exp(— r/ro), and the constant ro encodes how similar values of U at 
nearby points of X are expected to be; when the correlation length ro 
is small, the field U has dissimilar values near to one another, and so is 
rough; when ro is large, the field U has only similar values near to one 
another, and so is more smooth. 

By abuse of notation, Cjj will also denote the covariance operator of [/, 
which the linear operator Cjj : T 2 (T, day R) — x T 2 (T, day R) defined by 



Now let {fj n | n G N} be an orthonormal basis of eigenvectors of T 2 (T, day R) 
with corresponding eigenvalues {A n | n G N}, i.e. 



/ i>m(x)i>n(x) &X = 8, 

J X 


mn • 


Definition 11.2. Let X be a first- count able topological space. A function 
K : A x A —x R is called a Mercer kernel if 

(a) K is continuous; 

(b) K is symmetric, i.e. K(x,x') = K( x\x) for all x,x’ G A; and 


(c) K is positive semi-definite in the sense that, for all choices of finitely 
many points o?i, . . . , x n £ A, the Gram matrix 


K(x i,£i) ••• K(x i,x n ) 


G := 


K(x n ,x i) ••• K(x n ,x n ) 


is positive semi-definite, i.e. satisfies f • Gf > 0 for all £ £ M n . 
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Theorem 11.3 (Mercer). Let X be a first- countable topological space equipped 
with a complete Borel measure p. Let K : T x T — > R be a Mercer kernel. If 
x i— >> K(x,x) lies in then there is an orthonormal basis {V’njneN 

of L 2 [X, fi\ R) consisting of eigenfunctions of the operator 

f^[ K(-,y)f(y)dn(y) 

J X 

with non-negative eigenvalues {A n } nG ^. Furthermore, the eigenfunctions cor- 
responding to non-zero eigenvalues are continuous, and 

K(x,y) = n 1pn(x)lp n (y), 

nCN 

and this series converges absolutely, uniformly over compact subsets of X . 

The proof of Mercer’s theorem will be omitted, since the main use of the 
theorem is just to inform various statements about the eigendecomposition 
of the covariance operator in the Karhunen-Loeve theorem. However, it is 
worth comparing the conditions of Mercer’s theorem to those of Sazonov’s 
theorem (Theorem 2.49): together, these two theorems show which integral 
kernels can be associated with covariance operators of Gaussian measures. 

Theorem 11.4 (Karhunen-Loeve). Let U : X x O -x R be square- integrable 
stochastic process, with mean zero and continuous and square-integrable 1 co- 
variance function. Then 

U ^ ^ Z n fi n 

nCN 

in L 2 , where the {^n}neN are orthonormal eigenfunctions of the covariance 
operator Cjj , the corresponding eigenvalues {A n } nG ^ are non-negative, the 
convergence of the series is in L 2 ((9,/i;M) and uniform among compact fam- 
ilies of x G X , with 

Z n = U{x)fi n (x)dx. 

J X 

Furthermore, the random variables Z n are centred, uncorrelated, and have 
variance X n : 

[-^n] — 0; and lE M [Z m Z n ] — X n 5 rnn . 

Proof. By Exercise 2.1, and since the covariance function Cjj is continuous 
and square-integrable on X x X, it is integrable on the diagonal, and hence 
is a Mercer kernel. So, by Mercer’s theorem, there is an orthonormal basis 
{^n}ne n of L 2 (X, day M) consisting of eigenfunctions of the covariance op- 
erator with non-negative eigenvalues {A n } nG ^. In this basis, the covariance 
function has the representation 


1 In the case that X is compact, it is enough to assume that the covariance function is 

continuous, from which it follows that it is bounded and hence square-integrable on X x X . 
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Cu{x,y) = ^2 A n 1pn(x)lpn(y)- 

nG N 

Write the process U in terms of this basis as 

U(x,9) = Y / Zn(0)Mx), 

nG N 

where the coefficients Z n = Z n {6) are given by orthogonal projection: 

Z n (0) := [ U{x,6)^ n {x)dx. 

Jx 

(Note that these coefficients Z n are real- valued random variables.) Then 


E uXZnl = E 




U (x)lp n (x) dx 


IJx 


= [ E^[U(x)]'ip n (x)dx = 0. 
Jx 


and 


[Z m Z n ] — E M 
= E„ 


U (x)'lp rn (x) d x U (x)'ip n (x) dx 


IJx 


x 


/ / ^m(x)U (x)U (y)'ipn(y) dydx 

L Jx J x 

= [ ^m(x) [ E fJL [U(x)U(y)]'ip n (y)dydx 


x 


x 


= / ^m(x) / C u (x,y) f i/j n (y)dydx 
Jx Jx 

= / 'lpm{x)\ n 'lp n (x)dx 


JX 

An S' 


n^mn 


Let Sn •= ^2n=i • X x O -x R. Then, for any x G T, 

[|f/(x)-^(x)| 2 ] 

= E m [[/(x) 2 ] +E M [5jv(a;) 2 ] - 2E M [[/(x)^(x)] 


Cu(x, x) + E 


/i 

A/ - 


IV AT 


EE Zn^m^m (x)tp n (x) 


_n—l m— 1 


2E 




AT 


f/(x) ^ Z n 1 p n (x) 


ri—1 


= Cu(x, X) + ^ A ni>n{x) 2 - 2E 


n=l 

N 


N 


N 


E / U (x)U (y)ipn{y)ipn(x) dy 


_n— 1 ^ 


= Cu(x,x) + y] X n ^n(x) 2 - 2 y] / 0/(2:, y)ip n (y)ip n (x) dy 

n — 1 n=l ^ 


n=l 
N 

Cu(x, x) — y] X n ip n (xf 

n = 1 

0 as TV — )► oo. 
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where the convergence with respect of x, uniformly over compact subsets of 
A, follows from Mercer’s theorem. □ 

Among many possible decompositions of a random field, the Karhunen- 
Loeve expansion is optimal in the sense that the mean-square error of any 
truncation of the expansion after finitely many terms is minimal. However, its 
utility is limited since the covariance function of the solution process is often 
not known a priori. Nevertheless, the Karhunen-Loeve expansion provides an 
effective means of representing input random processes when their covariance 
structure is known, and provides a simple method for sampling Gaussian 
measures on Hilbert spaces, which is a necessary step in the implementation 
of the methods outlined in Chapter 6. 

Example 11.5. Suppose that C : T-L T-L is a self-adjoint, positive-definite, 
nuclear operator on a Hilbert space Ti and let m E Ti. Let (A k^k)ken be a 
sequence of orthonormal eigenpairs for C, ordered by decreasing eigenvalue 
A*;. Let Si, £ 2 ,... be independently distributed according to the standard 
Gaussian measure A/”(0, 1) on R. Then, by the Karhunen-Loeve theorem, 


u :=m + ^2 ^l /2 -kipk ( 11 - 1 ) 

k = 1 


is an FT valued random variable with distribution A/"(m, C). Therefore, a finite 

sum of the form m + A for large K is a reasonable approxima- 

tion to a ]\[{m, C)-distributed random variable; this is the procedure used to 
generate the sample paths in Figure 11.1. 

Note that the real- valued random variable X k ^ & has Lebesgue density 
proportional to exp(— |<f/ c | 2 /2A/ c ). Therefore, although Theorem 2.38 shows 
that the infinite product of Lebesgue measures on spanj^ | k E N} cannot 
define an infinite-dimensional Lebesgue measure on H, U — m defined by 
(11.1) may be said to have a ‘formal Lebesgue density’ proportional to 


11 exp 

keN 



= exp 


= exp 


= exp 






rrh^pn | 2 

V 




2 

H 


by Parseval’s theorem and the eigenbasis representation of C. This formal 
derivation should make it intuitively reasonable that U is a Gaussian random 
variable on 1~L with mean m and covariance operator C . For more general 
sampling schemes of this type, see the later remarks on the sampling of Besov 


measures. 
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I< = 100 


K = 500 



Fig. 11.1: Approximate sample paths of the Gaussian distribution on 
.£/q ([0, 1]) that has mean path m(x) = x(l — x) and covariance operator 

(—^ 2 ) • Along with the mean path (black), six sample paths (grey) are 

shown for truncated Karhunen-Loeve expansions using K E N terms. Ex- 
cept for the non-trivial mean, these are approximate draws from the unit 
Brownian bridge on [0, 1]. 


Principal Component Analysis. As well as being useful for the analysis 
of random paths, surfaces, and so on, Karhunen-Loeve expansions are also 
useful in the analysis of finite-dimensional random vectors and sample data: 

Definition 11.6. A principal component analysis of an valued random 
vector U is the Karhunen-Loeve expansion of U seen as a stochastic process 
U : {1, . . . , iV} x X R. It is also known as the discrete Karhunen-Loeve 
transform , the Hotelling transform and the proper orthogonal decomposition. 

Principal component analysis is often applied to sample data, and is inti- 
mately related to the singular value decomposition: 

Example 11.7. Let X £ ]^ NxM p e a ma trix whose columns are M indepen- 
dent and identically distributed samples from some probability measure on 
R^, and assume without loss of generality that the samples have empirical 
mean zero. The empirical covariance matrix of the samples is 
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(If the samples do not have empirical mean zero, then the empirical mean 
should be subtracted first, and then in the definition of C should be 

replaced by M 1 _ 1 so that C will be an unbiased estimator of the true covari- 
ance matrix C.) The eigenvalues A n and eigenfunctions ip n of the Karhunen- 
Loeve expansion are just the eigenvalues and eigenvectors of this matrix C . 
Let A G M 7VxAr be the diagonal matrix of the eigenvalues A n (which are non- 
negative, and are assumed to be in decreasing order) and & G 'R NxN the 
matrix of corresponding orthonormal eigenvectors, so that C diagonalizes as 

6 = &A& T . 


The principal component transform of the data X is W := X] this is 
an orthogonal transformation of that transforms X to a new coordinate 
system in which the greatest component-wise variance comes to he on the 
first coordinate (called the first principal component), the second greatest 
variance on the second coordinate, and so on. 

On the other hand, taking the singular value decomposition of the data 
(normalized by the number of samples) yields 

X X = U2JVT ’ 

where U G H NxN and V G R MxM are orthogonal and X G R 7VxM is diagonal 
with decreasing non- negative diagonal entries (the singular values of -A=X). 
Then 

C = UXV T (UXV T ) T = UXV T VX T U T = UX 2 U J . 

from which we see that U — ^ and X 2 — A. This is just another instance 
of the well-known relation that, for any matrix A , the eigenvalues of A A* 
are the singular values of A and the right eigenvectors of AA * are the left 
singular vectors of A; however, in this context, it also provides an alternative 
way to compute the principal component transform. 

In fact, performing principal component analysis via the singular value 
decomposition is numerically preferable to forming and then diagonalizing 
the covariance matrix, since the formation of XX T can cause a disastrous 
loss of precision; the classic example of this phenomenon is the Lauchli matrix 


1 e 0 0 
1 0 e 0 
1 0 0 e 


(0 < e 1 ), 


for which taking the singular value decomposition (e.g. by bidiagonalization 
followed by QR iteration) is stable, but forming and diagonalizing XX T is 
unstable. 
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Karhunen— Loeve Sampling of Non-Gaussian Besov Measures. The 

Karhunen-Loeve approach to generating samples from Gaussian measures of 
known covariance operator, as in Example 11.5, can be extended to more 
general settings, in which a basis is prescribed a priori and (not necessarily 
Gaussian) random coefficients with a suitable decay rate are used. The choice 
of basis elements and the rate of decay of the coefficients together control the 
smoothness of the sample realizations; the mathematical hard work lies in 
showing that such random series do indeed converge to a well-defined limit, 
and thereby define a probability measure on the desired function space. 

One method for the construction of function spaces — and hence ran- 
dom functions — of desired smoothness is to use wavelets. Wavelet bases are 
particularly attractive because they allow for the representation of sharply 
localized features — e.g. the interface between two media with different mat- 
erial properties — in a way that globally smooth basis functions such as 
polynomials and the Fourier basis do not. Omitting several technicalities, a 
wavelet basis of L 2 (R d ) or L 2 ( T d ) can be thought of as an orthonormal basis 
consisting of appropriately scaled and shifted copies of a single basic element 
that has some self-similarity. By controlling the rate of decay of the coeffi- 
cients in a wavelet expansion, we obtain a family of function spaces — the 
Besov spaces — with three scales of smoothness, here denoted p, q and s. In 
what follows, for any function / on or T rf , define the scaled and shifted 
version ffk of / for j, k E Z by 

fjA x ) '■= /( 2 V - k ). (11.2) 

The starting point of a wavelet construction is a scaling function (also 
known as the averaging function or father wavelet ) <f: R — > R and a family 
of closed subspaces Vj C L 2 (M), j E Z, called a multiresolution analysis of 
L 2 (M), satisfying 

(a) (nesting) for all j E Z, Vj C Vj+ 1 ; 

(b) (density and zero intersection) Vj = L 2 ( R) and fl i6 z V = M; 

(c) (scaling) for ah j, k E Z, / E V 0 Vj ; 

(d) (translates of <f generate Vo) Vo = span{0oy | k E Z}; 

(e) (Riesz basis) there are finite positive constants A and B such that, for 
all sequences (ck)kez £ ^ 2 (^)> 


A\\( c k)h 2 (z) < 


^ Ckf>0,k 
kez 


A B\\(ck)\\p{Z). 

L 2 ( R) 


Given such a scaling function f>\ M — )► R, the associated mother wavelet 
if: R R is defined as follows: 
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if (j)(x) = 2x — fc), 

kez 

then ^(x) = l) fe c/ c+ i0(2x + k). 

kez 

It is the scaled and shifted copies of the mother wavelet ^ that will form the 
desired orthonormal basis of L 2 . 

Example 11.8. (a) The indicator function </> = I[o,i) satisfies the self- 
similarity relation (j){x) = </>( 2x) + cj)(2x — 1); the associated given by 


{ 1, if 0 < x < 

— 1, if \ < x < 1, 

0, otherwise. 

is called the Haar wavelet. 

(b) The B-spline scaling functions cr r , r E No, are piecewise polynomial of 
degree r and globally C r_1 , and are defined recursively by convolution: 


f 1(0,1), for r = 0, 

lcr r _i*<7o, for r E N, 


(11.3) 


where 

■= / f{y)g{x -y)dy. 

Jr 

Here, the presentation focusses on Besov spaces of 1-periodic functions, 
i.e. functions on the unit circle T := R/Z, and on the d-dimensional unit 
torus T d := M, d /Z d . To this end, set 

<j)(x) := E cj)(x + s) and ij){x) := ^{x + s). 

sez sez 

Scaled and translated versions of these functions are defined as usual by 
(11.2). Note that in the toroidal case the spaces V 3 for j < 0 consist of 
constant functions, and that, for each scale j E No, E Vo has only 2 J 
distinct scaled translates (fij^ G Vj, i.e. those with k = 0, . . . , 2 J_1 . Let 

V, := span{0j } fc | k = 0, . . . , 2 3 - 1}, 

Wj := span{^-,/c | k = 0, . . . , 2 J - 1}, 

so that Wj is the orthogonal complement of V 3 in V J+ i and 

L 2 ( T) = 1J V = 0 Wj 

je N 0 jGNq 
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Indeed, if ijj has unit norm, then 2 also has unit norm, and 

{2^ 2 ipj,k | k = 0, . . . , 2 J — 1 }is an orthonormal basis of Wp and 
{2 i/2 ip 3 , k \jeN 0 ,k = 0,...,2? -1} is an orthonormal basis of L 2 (T), 


a so-called wavelet basis. 

To construct an analogous wavelet basis of L 2 ( T d ) for d > i, proceed as 
follows: for v G {0, l} d \ {(0, . . . , 0)}, j G No, and k G {0, . . . , 2 J — l} d , define 
the scaled and translated wavelet ip" k : T d -G R by 

:= 2 dj ^ 2 ip Vl (2 J xi - fci) • • ■ip" d (2 j x d - k d ) 
where ip° = p> and ip 1 = ip. The system 



j eN 0) fce {0, . . . , 2 J ' - 1}' 


^G{0,lA{(0,...,0)}} 


is an orthonormal wavelet basis of L 2 (T^). 

The Besov space can be characterized in terms of the summability 

of wavelet coefficients at the various scales: 

Definition 11.9. Let 1 < p, q < oo and let s > 0. The Besov (p, g, s) norm 
of a function u = JN k v u" k ipj k : T d -G R is defined by 


E E 

; = 

j A 2 js 2 jd ^ p) \\(k,v) A Uj^ k 

£p 

j€ N 0 v,k 








N 

q/ 


; = 

2^ s 2 

E K.I" 




^'GNo 

\ ) 

1 


*«(No) 
n\ 1 / ( 1 

J 


and the Besov space B pq (T d ) is the completion of the space of functions for 
which this norm is finite. 

Note that at each scale j, there are ( 2 d — l)2^ d = 2^ +1 ^ — 2^ d wavelet 
coefficients. The indices j, k and v can be combined into a single index £ G N. 
First, i = 1 corresponds to the scaling function (p(x i) • • • (p(xd)- The remaining 
numbering is done scale by scale; that is, we first number wavelets with j — 0, 
then wavelets with j = 1, and so on. Within each scale j G No, the 2 d — 1 
indices v are ordered by thinking them as binary representation of integers, 
and an ordering of the 2^ d translations k can be chosen arbitrarily. With this 
renumbering, 


G B pg( Td ) ^ 2 B 2 jd ^ 

£=1 


i) 



\ £=20 d 



1/p 


G ^(N 0 ) 
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For p = q, since at scale j it holds that < t < 2f- 7 ' +1 ' d , an equivalent norm 
for Bp p (T d ) is 


> ,utrpt 
ten 


B° p (T d ) 


een 


X S ’P 


oo 


(fps/d+p/ 2 - 


1) 


^=1 



1/p 

5 


in particular if the original scaling function and mother wavelet are r times 
differentiable with r > then £>22 coincides with the Sobolev space H s . 
This leads to a Karhunen-Loeve-type sampling procedure for B pp (T d ), as in 
Example 11.5: U defined by 


U:=Tt 


— ( — — L r. 

V d ' 2 




— 1 


(11.4) 


where Sn are sampled independently and identically from the generalized 
Gaussian measure on R with Lebesgue density proportional to exp( — -||<^| p ), 
can be said to have ‘formal Lebesgue density’ proportional to exp(— § lltdl P RS ), 

z D pp 

and is therefore a natural candidate for a ‘typical’ element of the Besov space 
B pp (T d ). More generally, given any orthonormal basis {ipk \ k G N} of some 
Hilbert space, one can define a Banach subspace X s,p with norm 


''Bueipe 

een 


X S ’P 


oo 


£{ps/d+p/ 2 - 1 ) 





1/p 


and define a Besov- distributed random variable U by (11.4). 

It remains, however, to check that (11.4) not only defines a measure, but 
that it assigns unit probability mass to the Besov space from which it is 
desired to draw samples. It turns out that the question of whether or not 
U G X s ’ p with probability one is closely related to having a Fernique theorem 
(q.v. Theorem 2.47) for Besov measures: 

Theorem 11.10. Let U be defined as in (11.4), with 1 < p < oo and s > 0. 
Then 


\\U\\ x rv < oo almost surely <^=4> E[exp(a||[/||^. t>p )] < oo for all a G (0, ^) 

<==> t < s - - 

p 

Furthermore, forp > 1 , s > -, and t < s — -, there is a constant r* depending 
only on p, d, s and t such that, for all a G (0, ^yr), 
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11.2 Wiener— Hermite Polynomial Chaos 

The next section will cover polynomial chaos (PC) expansions in greater gen- 
erality, and this section serves as an introductory prelude. In this, the classical 
and notationally simplest setting, we consider expansions of a real- valued ran- 
dom variable U with respect to a single standard Gaussian random variable 
E, using appropriate orthogonal polynomials of E, i.e. the Hermite polyno- 
mials. This setting was pioneered by Norbert Wiener, and so it is known 
as the Wiener-Hermite polynomial chaos. The term Thaos’ is perhaps a bit 
confusing, and is not related to the use of the term in the study of dynami- 
cal systems; its original meaning, as used by Wiener (1938), was something 
closer to what would nowadays be called a stochastic process: 

“Of all the forms of chaos occurring in physics, there is only one class which has 
been studied with anything approaching completeness. This is the class of types of 
chaos connected with the theory of Brownian motion.” 

Let E ~ 7 = J\f( 0, 1) be a standard Gaussian random variable, and let 
He n G ^3, for n G No, be the Hermite polynomials , the orthogonal polynomials 
for the standard Gaussian measure 7 with the normalization 

/ He m (£)He„(£) cpg) = n\5 mn . 

Jr 

By the Weierstrass approximation theorem (Theorem 8.20) and the approx- 
imability of L 2 functions by continuous ones, the Hermite polynomials form 
a complete orthogonal basis of the Hilbert space L 2 (M, 7 ; R) with the inner 
product 

(U,V) L 1 M :=E[U(S)V(S)] = f U(0V(0 d 7 (0- 

JR 

Definition 11 . 11 . Let U G L 2 (M, 7 ; R) be a square-integrable real- valued 
random variable. The Wiener-Hermite polynomial chaos expansion of U with 
respect to the standard Gaussian E is the expansion of U in the orthogonal 
basis {He n } nGNo , i.e. 

U ^ ^ WnHe n (^) 
nGNo 

with scalar Wiener-Hermite polynomial chaos coefficients {u n } ne Q ^ 
given by 


u n — 


(G, He n ) L 2 


1 


*00 


He, 


t/(£) He n g)e-« 2/2 cg 


■OO 


m 1 1 £ 2 ( 7 ) n\y/ 2 f 

Note that, in particular, since Heo = 1, 

E[G] = (Heo, U)l 2 (~i) = ^n(Heo, He n )^ 2 ( 7 ) 

nGNo 


= U 0 , 
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so the expected value of U is simply its 0 th PC coefficient. Similarly, its 
variance is a weighted sum of the squares of its PC coefficients: 


Y[U] =E\\U -E[U] 


= E 


^ ^ ^nP-^n 




— ^ ^ (H^m ? H^n) L 2 (7) 

m,nCN 


E w nll He 

nCN 

E 

nCN 


n 1 1 L 2 (7) 


v? n n\ 


since K[U] = uq 


by Hermitian orthogonality 


Example 11.12. Let X ~ Af(m,cr 2 ) be a real-valued Gaussian random 
variable with mean m G R and variance a 2 > 0. Let Y := e x ; since logT is 
normally distributed, the non-negative-valued random variable Y is said to 
be a log-normal random variable. As usual, let £ ~ A/"(0, 1) be the standard 
Gaussian random variable; clearly X has the same distribution as m + crE, 
and Y has the same distribution as e m e cr “ . The Wiener-Hermite expansion 
of Y as Uk^k(^) has coefficients 


Vk 


(e m +° 3 ,He k (E)) 


He/ C (E)|| 2 


> m 1 


.CT£TJ~- 2 


yj 2 i7T Jr 

e m+cr 2 /2 ^ 


r? He fe (0' 


d£ 


k\ y/2n 

e m+c r 2 /2 ^ 

fe! 


He fc (0e 


-«-<t) 2 /2 




f H ek(w + a)e w ^ dw. 

Jr 


This Gaussian integral can be evaluated directly using the Cameron-Martin 
formula (Lemma 2.40), or else using the formula 


He n (x + y) 




He fe ( 2 /) 


1 


which follows from the derivative property He^ = nHe n _i, with x = a and 
y = w: this formula yields that 


Vk 


e m+a 2 /2 ^ 

k\ X2 tt 



a k ■'H ej(w)e 


W 


I 2 dw 


e m+a 2 /2 pk 

k\ 
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since the orthogonality relation (He m , Yie n ) L 2 ^ = n\S mn with n — 0 implies 
that every Hermite polynomial other than Heo has mean 0 under standard 
Gaussian measure. That is, 

Y = e m +W2 e k (~). (11.5) 

keNo 


The Wiener-Hermite expansion (11.5) reveals that E[T] 


k 


¥[T] = e 


_2m+cr" 


keN 


He 


k 1 1 L 2 (7) 


_ g2m+cr" 


_ e m + cr^/2 anc j 



Truncation of Wiener-Hermite Expansions. Of course, in practice, the 
series expansion U = u k^- e k(^) must be truncated after finitely many 

terms, and so it is natural to ask about the quality of the approximation 

K 

U^U K :=J2ukHe k (~). 

k=0 

Since the Hermite polynomials {He/-}/^^ form a complete orthogonal basis 
for L 2 (M, 7 ;R), the standard results about orthogonal approximations in 
Hilbert spaces apply. In particular, by Corollary 3.26, the truncation error 
U — U K is orthogonal to the space from which U K was chosen, i.e. 

spanjHeo, Hei, . . . , He^}, 

and tends to zero in mean square; in the stochastic context, this observation 
was first made by Cameron and Martin (1947, Section 2). 

Lemma 11.13. The truncation error U — U K is orthogonal to the subspace 


span{He 0 , Hei, . . . , He^} 

of L 2 (M, dy; R). Furthermore, lim^^oo U K = U in L 2 (M, 7; R) 


Proof. Let V := ^He m be any element of the subspace of L 2 (M, 7; R) 

spanned by the Hermite polynomials of degree at most K. Then 

K 


(U-U K ,V) L 2 h) 


E w " He " ME v rn He m 


,n>K 


.rn — 0 


— ^ ^ ^mVm (He n , He m 


n>K 

0. 
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Hence, by Pythagoras’ theorem, 


U\\h(-y) = \\U K \\h (y) + \\U - U* ||^ (7) , 


K M 2 


and hence || U — U K ||l 2 (7) 0 as K oo. 


□ 


11.3 Generalized Polynomial Chaos Expansions 

The ideas of polynomial chaos can be generalized well beyond the setting 
in which the elementary random variable E used to generate the orthogo- 
nal decomposition is a standard Gaussian random variable, or even a vector 
£ = (Ei , . . . , Ed) mutually orthogonal Gaussian random variables. Such 
expansions are referred to as generalized polynomial chaos (gPC) expansions. 

Let E = (Ei , . . . , Ed) be an M^-valued random variable with independent 
(and hence L 2 -orthogonal) components, called the stochastic germ. Let the 
measurable rectangle O = Oi x • • • x Od Q be the support (i.e. range) of 
E. Denote by p = pi 0 - • - 0/Xd the distribution of E on O. The objective is to 
express any function (random variable, random vector, or even random field) 
U G L 2 (0 , p) in terms of elementary /i-orthogonal functions of the stochastic 
germ E. 

As usual, let denote the ring of all d-variate polynomials with real 
coefficients, and let denote those polynomials of total degree at most 

p G No. Let r p C be a collection of polynomials that are mutually 

orthogonal, orthogonal to and span Assuming for convenience, 

as usual, the completeness of the resulting system of orthogonal polynomials, 
this yields the orthogonal decomposition 

L 2 ((9,/i;M) = span F v . 

It is important to note that there is a lack of uniqueness in these basis poly- 
nomials whenever d > 2: each choice of ordering of multi-indices a G Nq 
can yield a different orthogonal basis of L 2 ((9,/i) when the Gram-Schmidt 
procedure is applied to the monomials 

Note that (as usual, assuming separability) the L 2 space over the product 
probability space ( S , p) is isomorphic to the Hilbert space tensor product 
of the L 2 spaces over the marginal probability spaces: 

d 

L 2 (Oi x • • • x O d , /ui <g> • • • <g> fj,d\ R) = (^) L 2 (Oi, Hi; R); 

i—1 
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hence, as in Theorem 8.25, an orthogonal system of multivariate polynomials 
for L 2 ((9,/r;M) can be found by taking products of univariate orthogonal 
polynomials for the marginal spaces L 2 (6k, fii\ R). A generalized polynomial 
chaos (gPC) expansion of a random variable or stochastic process U is simply 
the expansion of U with respect to such a complete orthogonal polynomial 
basis of L 2 ((9, /i). 

Example 11.14. Let E = (£4, £ 2 ) be such that E\ and E 2 are independent 
(and hence orthogonal) and such that E\ is a standard Gaussian random 
variable and E 2 is uniformly distributed on [—1,1]. Hence, the univariate 
orthogonal polynomials for E\ are the Hermite polynomials He n and the 
univariate orthogonal polynomials for E 2 are the Legendre polynomials Le n . 
Thus, by Theorem 8.25, a system of orthogonal polynomials for E up to total 
degree 3 is 


a = {a, 

A ={Hei(£i),Lei(&)} 

= { 6 , 6 }, 

A = {He 2 (A),He 1 (A)Le 1 (6),Le 2 (6)} 

= ft?-i,66,E3A 2 -i)}, 

A = {He 3 (A),He 2 (A)Le 1 (6),He 1 (A)Le 2 (A),Le 3 (A)} 

= ft? - 36,6 2 6 - A, 1(36^2 - 6), |(5£ 2 3 - 36)}- 

Remark 11.15. To simplify the notation in what follows, the following con- 
ventions will be observed: 

(a) To simplify expectations, inner products and norms, ( • ) jJL or simply ( • ) 
will denote integration (i.e. expectation) with respect to the probability 
measure /x, so that the L 2 (/x) inner product is simply (A, T)l 2 ( m ) = 

(b) Rather than have the orthogonal basis polynomials be indexed by multi- 
indices ft G Nq, or have two scalar indices, one for the degree p and one 
within each set T p , it is convenient to order the basis polynomials using 
a single scalar index k G No- It is common in practice to take \Pq = 1 and 
to have the polynomial degree be (weakly) increasing with respect to the 
new index k. So, to continue Example 11.14, one could use the graded 
lexicographic ordering on ft E Nq so that #b(0 = 1 and 

Aft) = 6, Aft) =6, Aft) = £ 1 - 1 , 

Aft) = 66, Aft) = |(3£ 2 - l), Aft) = £? - 36, 

Aft) = £?6 - 6, Aft) = i(366 2 - 6), Aft) = |(5£ 2 3 - 3£ 2 ). 

(c) By abuse of notation, will stand for both a polynomial function (which 

is a deterministic function from to R) and for the real- valued random 
variable that is the composition of that polynomial with the stochastic 
germ E (which is a function from an abstract probability space to R). 
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Truncation of gPC Expansions. Suppose that a gPC expansion of the 
form U = u k¥k is truncated, i.e. we consider 

K 

u K = j2 u ^k- 

k = 0 

It is an easy exercise to show that the truncation error U — U K is orthogonal 
to spanjl^o, . . . , &k}- It is also worth considering how many terms there are 
in such a truncated gPC expansion. Suppose that the stochastic germ £ has 
dimension d (i.e. has d independent components), and we work only with 
polynomials of total degree at most p. The total number of coefficients in the 
truncated expansion U K is 


K + 1 


(d + p)\ 
dip ! 


That is, the total number of gPC coefficients that must be calculated grows 
combinatorially as a function of the number of input random variables and the 
degree of polynomial approximation. Such rapid growth limits the usefulness 
of gPC expansions for practical applications where d and p are much greater 
than the order of 10 or so. 

Expansions of Random Variables. Consider a real- valued random vari- 
able [/, which we expand in terms of a stochastic germ £ as 


U K (~) = 

keN o 

where the basis functions ^ are orthogonal with respect to the law of £, 
and with the usual convention that #b = 1- A first, easy, observation is that 

E [17] = (%U) = WfcWfc) = u 0 , 

keNo 


so the expected value of U is simply its 0 th gPC coefficient. Similarly, its 
variance is a weighted sum of the squares of its gPC coefficients: 


'\U -E[U]\ 2 ' 

= E 

u^k 

2" 



keN o 



= ^ UkUt^Pk&i) 


keN 
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Similar remarks apply to any truncation U K = Ylk=i u k^k of the gPC exp- 
ansion of U. In view of the expression for the variance, the gPC coefficients 
can be used as sensitivity indices. That is, a natural measure of how strongly 
U depends upon is 

uJM. 

Expansions of Random Vectors. Similarly, if C7i , . . . , U n are (not neces- 
sarily independent) real-valued random variables, then the M n -valued ran- 
dom variable U = [Ui, . . . , U n \ T with the Ui as its components can be given 
a (possibly truncated) expansion 

u (o = E 

keN o 

with vector-valued gPC coefficients Uk = [u\^ . . . , u n ^\ T G M n for each 
k G No- As before, 

E[E7] = {% U) = Y =u 0 GR n 

ken 0 

and the covariance matrix C G M nXn of U is given by 

C = Y UkU k^k) 

keN 

i.e. its components are Qj = J2keN u i,k u j,k{^k)- 

Expansions of Stochastic Processes. Consider now a stochastic process 
C7, i.e. a function U : O x A — > R. Suppose that U is square integrable in 
the sense that, for each x G A, U(-,x) G L 2 ((9,/i) is a real- valued random 
variable, and, for each 6 G (9, [7(0, •) G L 2 (T, dx) is a scalar field on the 
domain A. Recall that 

L 2 (<9,/u;R) ®L 2 (X,dx;R) ^ L 2 (<9 x /j <g> da;; R) ^ L 2 (0, /x; L 2 (A\ dx)) , 

so U can be equivalently viewed as a linear combination of products of 
M-valued random variables with deterministic scalar fields, or as a function 
on O x A, or as a field- valued random variable. As usual, take | k G No} 
to be an orthogonal polynomial basis of L 2 ((9,/qM), ordered (weakly) by 
total degree, with = 1. A gPC expansion of the random field U is an 
L 2 -convergent expansion of the form 

U{x,£) = Y u k {x)^ k {£). 

ke No 
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The functions Uk : X —> R are called the stochastic modes of the process U . 
The stochastic mode uo : X R is the mean field of U : 

K[U (x)] = uo(x). 

The variance of the field at x G T is 

Y[U(x)} = ^2u k (x) 2 (^ 2 k ), 
keN 

whereas, for two points x,y G T, 

E [U(x)U(y)] = (Y, T Mv)MO 

Wn o reN 0 

= ^ u k (x)u k (y){&%) 
ken 0 

and so the covariance function of U is given by 

Cu(x,y ) = ^u k (x)u k {y)('I'l). 
ken 

The previous remarks about gPC expansions of vector- valued random vari- 
ables are a special case of these remarks about stochastic processe, namely 
X = { 1 , . . . ,n}. At least when dimT is low, it is very common to see the 
behaviour of a stochastic field U (or its truncation U K ) summarized by plots 
of the mean field and the variance field, as well as a few ‘typical’ sample re- 
alizations. The visualization of high-dimensional data is a subject unto itself, 
with many ingenious uses of shading, colour, transparency, videos and user 
interaction tools. 

Changes of gPC Basis. It is possible to change between representations 
of a stochastic quantity U with respect to gPC bases | k G No} and 
{<h>k | k G No} generated by measures fi and v respectively. Obviously, for 
such changes of basis to work in both directions, fi and v must at least have 
the same support. Suppose that 

U = ^ Uk&k = ^ Vk@k- 

ke No ke No 

Then, taking the L 2 {u )~ inner product of this equation with 

{U$ e )v = Y u kO' k $z) v = v t (V%) v , 
ken 0 

provided that £ L 2 { v) for all k G No, i.e. 
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V£ 


E 

keNo 


Uh^k^l 


i V 




Similarly, taking the L 2 (/i)-inner product of this equation with yields that, 
provided that GE L 2 (p) for all k G No, 




E 

keNo 


Vki^k^i) 


M 


<«?> 


r / m 


Remark 11.16. It is possible to adapt the notion of a gPC expansion to 
the situation of a stochastic germ E with arbitrary dependencies among its 
components, but there are some complications. In summary, suppose that 
£ = (Ei, . . . , Ed), taking values in 0 = 6fi x • • • x (9^, has joint law /i, which 
is not necessarily a product measure. Nevertheless, let pi denote the marginal 
law of Ej, i.e. 


Hi(Ei) := p(0 1 x • • • x x Ei x x • • • x <9 d ). 

To simplify matters further, assume that p (resp. pi) has Lebesgue density 
p (resp. pi). Now let <pp pG No, be univariate orthogonal polynomials 

for pi. The chaos function associated with a multi-index a E Nq is defined 
to be 

*a(0 := 

It can be shown that the family | a G N[J} is a complete orthonormal 
basis for L 2 ((9,/qM), so we have the usual series expansion U = ^ a u a \l/ a . 
Note, however, that with the exception of = 1, the functions \I/ a are not 
polynomials. Nevertheless, we still have the usual properties that truncation 
error is orthogonal to the approximation subspace, and 

E ,[U]=u 0 , V„[£/] = E 

0^0 

Remark 11.17. Polynomial chaos expansions were originally introduced in 
stochastic analysis, and in that setting the stochastic germ E typically has 
countably infinite dimension, i.e. E = (Ei, . . . , Ed, . . . ). Again, for simplicity, 
suppose that the components of E are independent, and hence orthogonal; 
let O denote the range of E, which is an infinite product domain, and let 
p = (£) deN Pd denote the law of E. For each d G N, let {ip a] \ o^d G= No} be 
a system of univariate orthogonal polynomials for £d ~ Pd , again with the 
usual convention that = 1. Products of the form 

V*(0 := II Vi d) (£d) 
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are again polynomials when only finitely many ad ^ 0, and form an orthog- 
onal system of polynomials in L 2 ((9, /i; R). 

As in the finite-dimensional case, there are many choices of ordering for the 
basis polynomials, some of which may lend themselves to particular problems. 
One possible orthogonal PC decomposition of u(S) for u E L 2 ((9, /r; M), in 
which summands are arranged in order of increasing ‘complexity’, is 

w( S') = /o + E 

de N 

A u Ud 1 Ud 2 (^d 2 ) 

di,d2€N 


+ 



U &d 1 &d 2 •■■ a d k 



d 1 ,d 2 ,...,d k eN 


i.e. , writing for the image random variable 

u = u 0 + Y^u ad ^ 

de N 

+ E 

d\ } d 2 eN 


+ 



r ^ j ad 1 Oid 2 


di,d 2 ,...,d k eN 


\J/( d l)\J/( d 2 ) . . . \J/( d k ) 
% &d 2 a d k 


The PC coefficients u ad E R, etc. are determined by the usual orthogonal 
projection relation. In practice, this expansion must be terminated at finite 
/c, and provided that u is square- integr able, the L 2 truncation error decays 
to 0 as k — > oo, with more rapid decay for smoother u , as in, e.g., Theorem 
8.23. 


11.4 Wavelet Expansions 

Recall from the earlier discussion of Gibbs’ phenomenon in Chapter 8 that 
expansions of non-smooth functions in terms of smooth basis functions such 
as polynomials, while guaranteed to be convergent in the L 2 sense, can have 
poor pointwise convergence properties. However, to remedy such problems, 
one can consider spectral expansions in terms of orthogonal bases of functions 
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in L 2 ((9,/i;M) that are no longer polynomials: a classic example of such a 
construction is the use of wavelets , which were developed to resolve the same 
problem in harmonic analysis and its applications. This section considers, by 
way of example, orthogonal decomposition of random variables using Haar 
wavelets, the so-called Wiener-Haar expansion. 

Definition 11.18. The Haar scaling function is <p(x) := I[ 0 ,i)(^)- For j E No 
and k E {0, . . . , V — 1}, let cj)j^(x) := 2^'/ 2 0(2 i x — k ) and 

Vj := span{0j ?o , . . . , 1 }- 

The Haar function (or Haar mother wavelet ) 0: [0, 1] R is defined by 

{ 1, if 0 < x < 

— 1, if | < x < 1, 

0, otherwise. 

The Haar wavelet family is the collection of scaled and shifted versions ipj^k 
of the mother wavelet ip defined by 

ipjg ~(x) := 2^ 2 f)(fFx — k) for j E No and k E {0, . . . , 2 2 — 1}. 

The spaces Vj form an increasing family of subspaces of L 2 ([0, l],dx;R), 
with the index j representing the level of ‘detail’ permissible in a function 
/ E Vj\ more concretely, Vj is the set of functions on [0, 1] that are constant 
on each half-open interval [2 -J 7c, 2 -J (fc + 1)). A straightforward calculation 
from the above definition yields the following: 

Lemma 11.19. For all j : j' E No, k E {0, . . . , 2 J — 1 } and k! E {0, . . . , 2 J — 1}, 

l 

ipj^k(x) dx = 0, and 

l 

dx = Sjj'Skk 1 - 

Hence, {1} U {ipj t k \ j € No, k € {0, 1, . . . , 2 J — 1}} is a complete orthonormal 
basis of L 2 ([ 0, 1], da:; R). If Wj denotes the orthogonal complement of Vj in 
Vj+i, then 




Wj = spanjV’j.o, • • • , i’j, 2 i-i}> and 

L 2 ([0,l],da:;R) = 0 Wj. 

je N 0 


Consider a stochastic germ E ~ p E Ali(R) with cumulative distribution 
function Fw\ R [0, 1]. For simplicity, suppose that Fs is continuous and 
strictly increasing, so that F~ is differentiable (with FP = 4^ = ps) almost 
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everywhere, and also invertible. We wish to write a random variable U E 
L 2 (M, /r; R), in particular one that may be a non-smooth function of E, as 

2 j -l 
jCN o A:=0 

2 J — i 

= u j,kW j,k (c)'-) 

je No A:=0 

such an expansion will be called a Wiener-Haar expansion of U. See Figure 
11.2 for an illustration comparing the cumulative distribution function of a 
truncated Wiener-Haar expansion to that of a standard Gaussian, showing 
the ‘clumping’ of probability mass that is to be expected of Wiener-Haar 
wavelet expansions but not of Wiener-Hermite polynomial chaos expansions. 
Indeed, the (sample) law of a Wiener-Haar expansion even has regions of 
zero probability mass. 

Note that, by a straightforward change of variables x = f 3 (0- 

/ w j , k (0Wf, k ’(0dn(0= [ d£ 

«/ M J M 

1 

^jj'^kk'-) 

so the family {Wjj c | j e N 0 , k e {0,... , 2 J — 1}} forms a complete 
orthonormal basis for L 2 (M, /r; R). Hence, the Wiener-Haar coefficients are 
determined by 

u jtk = {uw jtk )= f uiOWjMOpsiO dC 

JR 

1 

U(FX(x))i> jtk (x)dx. 

As in the case of a gPC expansion, the usual expressions for the mean and 
variance of U hold: 




2 J — i 

E [U] = u 0 and V[U] = E E k* 

jGNo k=0 


Comparison of Wavelet and gPC Expansions. Despite the formal simi- 
larities of the corresponding expansions, there are differences between wavelet 
and gPC spectral expansions. For gPC expansions, the globally smooth 
orthogonal polynomials used as the basis elements have the property that 
expansions of smooth functions/random variables enjoy a fast convergence 
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J = 5 



J = 6 



J = 7 


Fig. 11.2: The cumulative distribution function and binned peak- normalized 
probability density function of 10 5 i.i.d. samples of a random variable U 

with truncated Wiener-Haar expansion U = J2j=o ^2k=o u j,kWj,k(^), where 
£ ~ Af( 0, 1). The coefficients Uj ^ were sampled independently from Uj^ ~ 
2 _J A/’(0, 1). The cumulative distribution function of a standard Gaussian is 
shown dashed for comparison. 
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rate, as in Theorem 8.23; no such connection between smoothness and conver- 
gence rate is to be expected for Wiener-Haar expansions, in which the basis 
functions are non-smooth. However, in cases in which U shows a localized 
sharp variation or a discontinuity, a Wiener-Haar expansion may be more 
efficient than a gPC expansion, since the convergence rate of the latter would 
be impaired by Gibbs-type phenomena. Another distinctive feature of the 
Wiener-Haar expansion concerns products of piecewise constant processes. 
For instance, for /, g E V 3 the product fg is again an element of Vp it is 
not true that the product of two polynomials of degree at most n is again a 
polynomial of degree at most n. Therefore, for problems with strong depen- 
dence upon high-degree/high-detail features, or with multiplicative structure, 
Wiener-Haar expansions may be more appropriate than gPC expansions. 
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11.6 Exercises 

H 2 

Exercise 11.1. Consider the negative Laplacian operator C := — acting 
on real- valued functions on the interval [0, 1], with zero boundary conditions. 
Show that the eigenvalues /i n and normalized eigenfunctions of C are 

Hn = (7m) 2 , 

1pn(x) = V2 sin(7r nx). 

Hence show that C := £ _1 has the same eigenfunctions with eigenvalues 
A n = ( 7 rn) -2 . Hence, using the Karhunen-Loeve theorem, generate figures 
similar to Figure 11.1 for your choice of mean field m : [0, 1] R. 

Exercise 11.2. Do the analogue of Exercise 11.1 for C = (—A) a acting on 
real- valued functions on the square [0, l] 2 , again with zero boundary condi- 
tions. Try a = 2 first, then try a = 1, and try coarser and finer meshes in 
each case. You should see that your numerical draws from the Gaussian field 
with a = 1 fail to converge, whereas they converge nicely for a > 1. Loosely 
speaking, the reason for this is that a Gaussian random variable with covari- 
ance (— A) a is almost surely in the Sobolev space H s or the Holder space 
C s for s < a — |, where d is the spatial dimension; thus, a = 1 on the 
two-dimensional square is exactly on the borderline of divergence. 

Exercise 11.3. Show that the eigenvalues A n and eigenfunctions e n of the 
exponential covariance function C(x,y) = exp (— \x — y\/a) on [—b,b] are 
given by 

2 2 2 , if n G 2Z, 

a z w ^ 5 ’ 

if n G 2Z + 1, 
a 2^2 ? ? 

f sin (w n x)/ Jb- ^ : b) , if n S 2Z, 

e n (x) = < v j : 

|^cos(Y n :r)/ y b + sin ^ rj ' 5 ^ , if n G 2Z + 1, 

where re n and v n solve the transcendental equations 

{ aw n + tan (w n b) = 0, for n G 2Z, 

1 — av n tan (v n b) =0, for n G 2Z + 1. 

Hence, using the Karhunen-Loeve theorem, generate sample paths from the 
Gaussian measure with covariance kernel C and your choice of mean path. 
Note that you will need to use a numerical method such as Newton’s method 
to find approximate values for w n and v n . 

Exercise 11.4 (Karhunen-Loeve- type sampling of Besov measures). Let 
T d := ]& d /Z d denote the d-dimensional unit torus. Let {'ipi \ £ G N} be an 
orthonormal basis for L 2 ( T d , dx; R). Let q G [1, 00 ) and s G (0, 00 ), and define 
a new norm || • \\x s ^ on series u = by 
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^ujipe 

eeti 



Show that || • || X s ’-? is indeed a norm and that the set of u with ||n||x s ^ finite 
forms a Banach space. Now, for q E [1, oo), s > 0 , and k > 0 , define a random 
function U by 

U(x) := l~(d+2~ q)^~ q 

where Si are sampled independently and identically from the generalized 
Gaussian measure on R with Lebesgue density proportional to exp( — ^|^| g ). 
By treating the above construction as an infinite product measure and con- 
sidering the product of the densities exp( — ^ |<^| g ), show formally that U has 
‘Lebesgue density’ proportional to exp( — | ||iz||^- s>q ). 

Generate sample realizations of U and investigate the effect of the var- 
ious parameters <7, s and k. It may be useful to know that samples from 

the probability measure ^ M exp(— ( 3 q / 2 \x — m\ q ) dx can be generated as 

m + f 3 ~ 1 / 2 S\Y I 1 / 9 where S is uniformly distributed in { — 1,+1} and Y is 
distributed according to the gamma distribution on [0, 00) with parameter <7, 
which has Lebesgue density qe~ qx I[ 0>O o)(^)* 


Chapter 12 

Stochastic Galerkin Methods 


Not to be absolutely certain is, I think, one 
of the essential things in rationality. 


Am I an Atheist or an Agnostic? 

Bertrand Russell 


Chapter 11 considered spectral expansions of square-integrable random 
variables, random vectors and random fields of the form 

U = ^ u k &k, 

ken o 

where U E L 2 ((9, /i; ZY), U is a Hilbert space in which the corresponding det- 
erministic variables/ vectors/fields lie, and | k E No} is some orthogonal 
basis for L 2 ((9, /x; R). However, beyond the standard Hilbert space orthogonal 
projection relation 

_ (UV k ) 

Uk <n) ’ 

we know very little about how to solve for the stochastic modes u k E U. For 
example, if U is the solution to a stochastic version of some problem such as 
an ODE or PDE (e.g. with randomized coefficients), how are the stochastic 
modes u k related to solutions of the original deterministic problem, or to the 
stochastic modes of the random coefficients in the ODE/PDE? This chapter 
and the next one focus on the determination of stochastic modes by two 
classes of methods, the intrusive and the non-intrusive respectively. 

This chapter considers intrusive spectral methods for UQ, and in particular 
Galerkin methods. The Galerkin approach, also known as the Ritz-Galerkin 
method or the method of mean weighted residuals, uses the formalism of weak 
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solutions , as expressed in terms of inner products, to form systems of equa- 
tions for the stochastic modes, which are generally coupled together. In terms 
of practical implementation, this means that pre-existing numerical solution 
schemes for the deterministic problem cannot be used as they are, and must 
be coupled or otherwise modified to solve the stochastic problem. This situa- 
tion is the opposite of that in the next chapter: non-intrusive methods rely on 
individual realizations to determine the stochastic model response to random 
inputs, and hence can use a pre-existing deterministic solver 6 as is’. 

Suppose that the model relationship between some input data d and the 
output (solution) u can be expressed formally as 

U(u;d) = 0, (12.1) 

an equality in some normed vector space 1 hi. A weak interpretation of this 
model relationship is that, for some collection of test functions T QU ' , 

(r | lZ(u] d)) = 0 for all tgT. (12.2) 

Although it is clear that (12.1) => (12.2), the converse implication is not 
generally true, which is why (12.2) is known as a ‘weak’ interpretation of 
(12.1). The weak formulation (12.2) is very attractive both for theory and for 
practical implementation: in particular, the requirement that (12.2) should 
hold only for r in some basis of a finite-dimensional test space T lies at the 
foundation of many numerical methods. 

In this chapter, the input data and hence the sought-for solution are both 
uncertain, and modelled as random variables. For simplicity, we shall restrict 
attention to the L 2 case and assume that U is a square-integrable ZY-valued 
random variable. Thus, throughout this chapter, S := L 2 ((9, /q R) will denote 
the stochastic part of the solution space, so that U G U G S. Furthermore, 
given an orthogonal basis | k G No} of <S, we will take 

S K := span{$o, . . . ,$A}. 


12.1 Weak Formulation of Nonlinearities 

Nonlinearities of various types occur throughout UQ, and their treatment is 
critical in the context of stochastic Galerkin methods, which require us to 
approximate these nonlinearities within the finite-dimensional solution space 
Sk or U 0 Sk- Put another way, given gPC expansions for some random 
variables, how can the gPC expansion of a nonlinear function of those vari- 
ables be calculated? What is the induced map from gPC coefficients to gPC 
coefficients, i.e. what is the spectral representation of the nonlinearity? 


1 Or, more generally, topological vector space. 
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For example, given an infinite or truncated gPC expansion 

u=J2 u ^k, 

ke No 


how does one calculate the gPC coefficients of, say, U 2 or VTJ in terms of 
those of U1 The first example, Z7 2 , is a special case of taking the product of 
two gPC expansions: 

Galerkin Multiplication. The first, simplest, kind of nonlinearity to con- 
sider is the product of two or more random variables in terms of their gPC 
expansions. The natural question to ask is how to quickly compute the gPC 
coefficients of a product in terms of the gPC coefficients of the factors — 
particularly if expansions are truncated to finite order. 

Definition 12.1. Let {^k}keN 0 be an orthogonal set in L 2 ((9, /r; M). The 
associated multiplication tensor 2 (or Galerkin tensor ) is the rank-3 tensor 
M ijk , (■ i,j,k ) G Ng, defined by 



O'k'i'k) 


whenever n^j^k is /r-integrable. By mild abuse of notation, we also write 
Mijk for the finite-dimensional rank-3 tensor defined by the same formula for 
0 < k < K. 

Remark 12.2. (a) The multiplication tensor is symmetric in the first 
two indices (i.e. Mijk = Mjik). In general, there are no symmetries 
involving the third index. 

(b) Furthermore, since {\Pk}ke N 0 is an orthogonal system, many of the entries 
of Mijk are zero, and so it is a sparse tensor. 

(c) Note that the multiplication tensor is determined entirely by the gPC 

basis {? Pk}keN 0 and the measure /x, and so while there is a significant 
computational cost associated with evaluating its entries, this is a one- 
time cost: the multiplication tensor can be pre-computed, stored, and 
then used for many different problems. In a few special cases, the mul- 
tiplication tensor can be calculated in closed form, see, e.g., Exercise 
12.1. In other cases, it is necessary to resort to numerical integration; 
note, however, that since \P k is a polynomial, so is and hence the 

multiplication tensor can be evaluated numerically but exactly by Gauss 
quadrature once the orthogonal polynomials of sufficiently high degree 
and their zeros have been identified. 


2 Readers familiar with tensor notation from continuum mechanics or differential geometry 
will see that Mijk is covariant in the indices i and j and contravariant in the index k , and 
thus is a (2, l)-tensor; therefore, if this text were following standard tensor algebra notation 
and writing vectors as u k] Pk, then the multiplication tensor would be denoted M k -. In 
terms of the dual basis | k G No} defined by \ &#) — | typPj). 
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Example 12.3. Suppose that U = ^2 keNo u k \P k and V = ^2 keNo v k \P k are 
random variables in S := L 2 ((9,/i;M), with coefficients u k ,v k G R. Suppose 
that their product IT := UV is again a random variable in S. The strong 
form of this relationship is that IT = UV in <S, i.e. 

IT(<9) = U{9)V(9) for /i-a.e. (9 G 0. 

A weak interpretation, however, is that IT = TT holds only when tested 
against the basis {F k}keN 0 of 5, and this leads to a method for determining 
the coefficients w k in the expansion W = J2ke n 0 w F&k- Note that 

w = uwWi, 

i,jeN 0 

so the coefficients {wk \ k G No} are given by 

(WF k ) V- * r 

W k ~ /^ 2 \ — MijkUiVj. 

' zjGNo 

It is this formula that motivates the name multiplication tensor for Mijk- 
Now suppose that U and V in fact he in Sk , i.e. U = ^Z k=0 u kFk and 
V = XlfeLo 'Gc^/c- Then their product IT := £7T has the expansion 

K 

w = ^2 N mvj&iVjVk. 

keN 0 i,j = 0 


Note that, while IT lies in L 2 , it is not necessarily in Sk - Nevertheless, the 
truncated expansion 2 fij k= o ^ijk^i^j^k is the orthogonal projection of IT 
onto Sk , and hence the L 2 -closest approximation of IT in Sk- It is called the 
Galerkin product , or pseudo- spectral product , of U and V, denoted U *k V 
or simply U * V if it is not necessary to call attention to the order of the 
truncation. 

Remark 12.4. If [/, V ^ Sk , then we can have U * V ^ IJs K (UV). 

The fact that multiplication of two random variables can be handled effi- 
ciently, albeit with some truncation error, in terms of their expansions in the 
gPC basis and the multiplication tensor is very useful, and is a good reason to 
pre-compute and store the multiplication tensor of a basis for use in multiple 
problems. 

Proposition 12.5. For fixed K G No, the Galerkin product satisfies for all 
U,V,W G Sk and a, /3 G R, 
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u*v = n SK (uv ), 
u *v = v *u, 

(aU)*(pV) = aP(U*V), 

(U + V)*W = U*W + V*W. 

However , the Galerkin product is not associative, i. e. there can exist U,V,W G 
Sr such that U * ( V * W) ^ (U * V) * W. 

Proof. Exercise 12.3. □ 

Outside the situation of binary products, Galerkin multiplication has unde- 
sirable features that largely stem from the non-associativity property, which 
in turn is a result of compounded truncation error from repeated orthogonal 
projection into Sr • As shown by Exercise 12.3, it is not even true that one 
can make unambiguous sense of U n for n > 4! 

For example, suppose that we wish to multiply three random variables 
U,V,W G L 2 (0,fi) in terms of their gPC expansions in a fashion similar 
to the Galerkin product above. First of all, it must be acknowledged that 
perhaps Z := UVW ^ L 2 (0,/i). Nevertheless, assuming that Z is, after all, 
square- integr able, a gPC expansion of the triple product is 


Z — ^ ^ Zrn&m — ^ ^ 

mGN o mG No 




j,k,£eNo 




m ? 


or an appropriate truncation of the same, where the rank-4 tensor Tj^£ m is 
defined by 


ddjkim • — 




This approach can be extended to higher-order multiplication. However, even 
with sparsity, computation and storage of these tensors — which have (K+l) d 
entries when working with products of d random variables to polynomial 
degree K — quickly becomes prohibitively expensive. Therefore, it is com- 
mon to approximate the triple product in Galerkin fashion by two binary 
products, i.e. 

UVW nU*(V*W). 


Unfortunately, this approximation incurs additional truncation errors, since 
each binary multiplication discards the part orthogonal to Sr, the terms 
that are discarded depend upon the order of approximate multiplication and 
truncation, and in general 


u * (V * W) ^ V * (W * U) ^ W * (U * V). 
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As a result, in general, higher-order Galerkin multiplication can fail to com- 
mutative if it is approached using binary multiplication; to restore commu- 
tativity and a well-defined triple product, we must pay the price of working 
with the larger tensor Tjktm • 

Galerkin Inversion. After exponentiation to a positive integral power, 
another common transformation that must be performed is to form the rec- 
iprocal of a random variable: given 


K 

U = ^ Uk&k ~ ^ Uk&k £ Sk, 

k>0 k = 0 

^K 

we seek a random variable V = /2 k > o v k^k ~ /2k= o v k^k such that 
U(6)V{6) = 1 for almost every 0 G (9. The weak interpretation in Sk of 
this requirement is to find V G Sk such that U * V = #b- Since U * V has 
as its k th gPC coefficient o ^ ijk u i v j, we arrive at the following matrix- 

vector equation for the gPC coefficients of V: 


o MiooUi 

MiKOUi 


^0 


Y 

0 M-iOlUi 

Silo MiKlUi 


Vl 

— 

0 

o MiQRUi • 

Y^i= 0 MiKKUi_ 


y K _ 


_0_ 


Naturally, if U(0) = 0 for some 0, then V(0) will be undefined for that 6. 
Furthermore, if U ~ 0 with Too large’ probability, then V may exist a.e. 
but fail to be in L 2 . Hence, it is not surprising to learn that while (12.3) 
has a unique solution whenever the matrix on the left-hand side (12.3) is 
non-singular, the system becomes highly ill-conditioned as the amount of 
probability mass near U = 0 increases. 

In practice, it is essential to check the conditioning of the matrix on the 
left-hand side of (12.3), and to try several values of truncation order K , before 
placing any confidence in the results of a Galerkin inversion. Just as Remark 
9.16 highlighted the spurious ‘convergence’ of the Monte Carlo averages of 
the reciprocal of a Gaussian random variable, which in fact has no mean, 
Galerkin inversion can produce a ‘formal’ reciprocal for a random variable in 
Sk that has no sensible reciprocal in S. See Exercise 12.4 for an exploration 
of this phenomenon in the Gaussian setting. 

Similar ideas to those described above can be used to produce a Galerkin 
division algorithm for Galerkin gPC coefficients of U/V in terms of the gPC 
coefficients of U and V respectively; see Exercise 12.5. 

More General Nonlinearities. More general nonlinearities can be treated 
by the methods outlined above if one knows the Taylor expansion of the 
nonlinearity. The standard words of warning about compounded truncation 
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error all apply, as do warnings about slowly convergent power series, which 
necessitate very high order approximation of random variables in order to 
accurately resolve nonlinearities even at low order. 

Galerkin Formulation of Other Products. The methods described above 
for the multiplication of real- valued random variables can easily be extended 
to other settings, e.g. multiplication of random matrices of the appropriate 
sizes. If 


K 

A = G L 2 (0,/j;R mxn ) * R mxn <g> 5, 

k=0 

K 

B = J2 b k^k G L 2 (<9, fj,; R nxp ) ^ R nxp <g> S 

k=0 

are random matrices with coefficient matrices ak G M mXn and bk G W nXp , 
then their degree - K Galerkin product is the random matrix 

K 

C = Y^ c k$k G L 2 (0, R mxp ) ^ R mxp (E> S 

k=0 

with coefficient matrices G R mX ^ given by 

Ck — M-ijkCiibj . 

Similar ideas apply for operators, bilinear forms, etc., and are particularly 
useful in the Lax-Milgram theory of PDEs with uncertain coefficients, as 
considered later on in this chapter. 


12.2 Random Ordinary Differential Equations 

The Galerkin method is quite straightforward to apply to ordinary differential 
equations with uncertain coefficients, initial conditions, etc. that are modelled 
by random variables. Heuristically, the approach is as simple is multiplying 
the ODE by a gPC basis element ^ and averaging; we consider some concrete 
examples below. Simple examples such as these serve to illustrate one of the 
recurrent features of stochastic Galerkin methods, which is that the governing 
equations for the stochastic modes of the solutions are formally similar to 
the original deterministic problem, but generally couple together multiple 
instances of that problem in a non-trivial way. 

Example 12.6. Consider the linear first-order ordinary differential equation 

u(t) = —A u(t), 'u(O) = 6, (12.4) 
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where 6, A > 0. This ODE arises frequently in the natural sciences, e.g. as a 
simple model for the amount of radiation u(t) emitted at time t by a sample of 
radioactive material with decay constant A, i.e. half-life A -1 log 2; the initial 
level of radiation emission at time t = 0 is b. Now suppose that the decay 
constant and initial condition are not known perfectly, but can be described 
by random variables A, B E L 2 ((9, fi] R) (both independent of time t), so that 
the amount of radiation U (t) emitted at time t is now a random variable that 
satisfies the random linear first-order ordinary differential equation 


U(t) = -AU(t), U( 0) = B , (12.5) 

for square-integrable U: [0,T] x 0 R, or, equivalently, U: [0,T] 
L 2 (<9,/i;M). 

Let {^k}ke N 0 be an orthogonal basis for L 2 ((9,/qM) with the usual con- 
vention that $o = l. Suppose that our knowledge about A and B is encoded 
in the gPC expansions A = B = Y.ken 0 bk ^ k ', the aim is to 

find the gPC expansion of U(t) = Projecting the evolution 

equation (12.5) onto the basis {^k}keN 0 yields 

(U(t)& k ) = -{AUV k ) for each k E N 0 . 


Inserting the gPC expansions for A and U into this yields, for every k E No, 


i.e. 


i.e. 


E 

\je No 

iik(t) 


- ( E E ) . 

Wn 0 jeN 0 

- E 

i,jENo 

^ ^ A^ijk ^i'Ujj (t) • 
i,jE N 0 


The coefficients u k are a coupled system of countably many ordinary differ- 
ential equations. 

If all the chaos expansions are truncated at order iL, then all the above 
summations over No become summations over {0, ...,iL}, yielding a cou- 
pled system of K + 1 ordinary differential equations. In matrix-vector form, 
the vector u(t) e R k+1 of coefficients of the degree-iL Galerkin solution 
U^ K \t) E Sk satisfies 


ii(t) = —A(A)u(t), 'u(O) = fr, (12.6) 

where the matrix A(A) E R (K+i)x(K+i) p as as (k,i) th entry Mij k Xj, 

and b = (bo, ... , bx) E M K+1 . 

Note that the system (12.6) has the same form as the original deterministic 
problem (12.4); however, since A(A) is not generally diagonal, (12.6) consists 
of AT+1 non-trivially coupled instances of the original problem (12.4), coupled 
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Fig. 12.1: The degree-10 Hermite PC Galerkin solution to the random ODE 
(12.5), with log-normally distributed decay constant and initial condition. 
The solid curve shows the mean of the solution, the dashed curves show the 
higher-degree Hermite coefficients, and the grey envelope shows the mean 
± one standard deviation. Note that, on these axes, only the coefficients of 
degree < 5 are visible; the others are all of order 10 -2 or smaller. 


through the multiplication tensor and hence the matrix A(A). In terms of the 
pseudo-spectral product, (12.6) gives the evolution of the Galerkin solution 
U (K) as 


ATT(K) 

= ~{n SK A) * u {K \t ), u^ K \ o) = n SK B. (12.7) 

See Figure 12.1 for an illustration of the evolution of the solution to (12.6) 
in the Hermite basis when log A, log B ~ jV(0, 1) are independent. Recall 
from Example 11.12 that, under such assumptions, A and B have Hermite 
PC coefficients A k = bk = yfe/k\. 

Note well that we do not claim that the Galerkin solution is the optimal 
approximation in Sk to the true solution, i.e. we can have U ^ ^ Bfs K U , 
although Galerkin solutions can be seen as weighted projections. This is a 
point that will be revisited in the more general context of Lax-Milgram 
theory. 

Example 12.7. Consider the simple harmonic oscillator equation 

U(t) = -ft 2 U{t). (12.8) 

For simplicity, suppose that the initial conditions 17(0) = 1 and 17(0) = 0 
are known, but that ft is stochastic. Let { \Pk}keN 0 be an orthogonal basis for 
L 2 ((9,/i;M) with the usual convention that $o = 1. Suppose that ft has a 
gPC expansion ft = J2ke n 0 ^k&k and it is desired to find the gPC expansion 
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of £7, i.e. U(t) = u k(t)^ r k- Note that the random variable Y := i? 2 has 

a gPC expansion Y = ^Z keNo Vk&k with 

Vk — ^ ^ MijkidiLU j . 

ijeNo 

Projecting the evolution equation (12.8) onto the basis {\J/k}keN 0 yields 

(U(t)& k ) = -(YU(t)& k ) for each k e N 0 . 

Inserting the chaos expansions for W and £7 into this yields, for every k E No, 

( Y MtpiVk ) = -(J2 E u i( t )' p i' p k 

W No / \j£N 0 ieNo 

i.e. u k (t){&%) = - yjU i (t)('P i 'P j 'Pk), 

ijeNo 

i.e. u k (t) = - M M ijkVjUi(t). 
ijeNo 



If all these gPC expansions are truncated at order K, and A E ^( J ^+ 1 ) x (^+ 1 ) 
is defined by 


K K 

Aik • — ^ ^ MijkVj — ^ ^ MijkMpqjUJpidq , 

1=0 j,p,q=o 

then the vector u(t) of coefficients for the degree- K Galerkin solution (t) 
satisfies the vector oscillator equation 

u(t) = — A J u(t) (12.9) 

with the obvious initial conditions. 

See Figure 12.2 for illustrations of the solution to the Galerkin problem 
(12.9) when the Hermite basis is used and Q is log-normally distributed with 
log Q ~ A/"(0,cr 2 ) for various values of a > 0. Recall from Example 11.12 
that the Hermite coefficients of such a log-normal Q are u) k = / 2 cr k jk\. 
For these illustrations, the ODE (12.9) is integrated using the symplectic 
(energy-conserving) semi-implicit Euler method 

u(t H- At) = u(t) + v(t + At) At , 
v(t + At) = u(t) — A J u(t)At , 

which has a global error of order At. 
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Fig. 12.2: The degree-10 Hermite PC Galerkin solution to the simple har- 
monic oscillator equation of Example 12.7 with log-normally distributed an- 
gular velocity 17, log 17 ~ A/"(0,cr 2 ). The solid curve shows the mean of the 
solution, the dashed curves show the higher-degree Hermite coefficients, and 
the grey envelope shows the mean ± one standard deviation. In the case 
a = yq, the variance grows so quickly that accurate predictions of the sys- 
tem’s state after just one or two cycles are essentially impossible. 
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12.3 Lax— Milgram Theory and Random PDEs 


The Galerkin method lies at the heart of modern methods for the analyt- 
ical treatment and numerical solution of PDEs. Furthermore, when those 
PDEs have uncertain data (e.g. uncertainty coefficients, or uncertain initial 
or boundary conditions), we have the possibility of a ‘double Galerkin’ app- 
roach, using the notion of a weak solution over both the deterministic and 
the stochastic spaces. This section covers the deterministic picture first, and 
the following section covers the stochastic case, and discusses the coupling 
phenomena that have already been discussed for ODEs above. 

The abstract weak formulation of many PDEs is that, given a real Hilbert 
space T-L equipped with a bilinear form a: T-L x T-L R, and / G Ti! (i.e. a 
continuous linear functional /: Ti M), we seek 

u G Ti such that a(u, v) = (/ | v) for all v G Ti. (12.10) 

Such a u is called a weak solution , and (12.10) is called the weak problem. 
The cardinal example of this setup is an elliptic boundary value problem: 

Example 12.8. Let X C M n be a bounded, connected domain. Let a matrix- 
valued function k\ X M nXn and a scalar- valued function f:X M M be 
given, and consider the elliptic problem 

—V • (k(x)Vu(x)) = f(x) for xGl, (12.11) 

u(x) =0 for x G dX. 

The appropriate bilinear form a( • , • ) is defined by 

a(u,v) := (-V • {kVu),v) L 2 ( X ) = (kVw, Vv)l 2 (*)> 

where the second equality follows from integration by parts when u, v are 
smooth functions that vanish on dX ; such functions form a dense subset of 
the Sobolev space Hq(X). This short calculation motivates two important 
developments in the treatment of the PDE (12.11). First, even though the 
original formulation (12.11) seems to require the solution u to have two orders 
of differentiability, the last line of the above calculation makes sense even if 
u and v have only one order of (weak) differentiability, and so we restrict 
attention to Hq(X). Second, we declare u G Hq(X) to be a weak solution of 
(12.11) if the L 2 (X) inner product of (12.11) with any v G Hq(X) holds as 
an equality of real numbers, i.e. if 


- / V • (n(x)Vu(x))v(x) dx = / f(x)v(x) dx 


i.e. if 

a(u,v) = (f,v) L 2 (*) for all v G Hq(X), 
which is a special case of (12.10). 
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The existence and uniqueness of solutions problems like (12.10), under 
appropriate conditions on a (which of course are inherited from appropriate 
conditions on n), is ensured by the Lax-Milgram theorem, which generalizes 
the Riesz representation theorem that any Hilbert space is isomorphic to its 
dual space. 

Theorem 12.9 (Lax-Milgram). Let a be a bilinear form on a Hilbert space 
H, i.e. a E H' (g) H' , such that 

(a) (boundedness) there exists a constant C > 0 such that, for all u,v <EFL, 
\a(u,v)\ < C\\u\\ ||t 7 ||; and 

(b) (coercivity) there exists a constant c > 0 such that, for all v E H, 

| a(v, v)\ > c||i;|| 2 . 

Then, for all f E H' , there exists a unique u E H such that, for all v E H, 
a(u,v) = (/ \v). Furthermore, u satisfies the estimate \\u\\^ < c _ 1 ||/||t^/. 

Proof. For each u FL, v a(u , v) is a bounded linear functional on FL. So, 
by the Riesz representation theorem (Theorem 3.15), given u E H, there is 
a unique w E H such that (w, •) = a{u, •). Define Au := w. This defines a 
well-defined function A: H FL, the properties of which we now check: 

(a) A is linear. Let aq and 02 be scalars and let U\,U 2 G H. Then 

(A(a\Ui + OL2U2), v ) = a(oL\U\ + 02112, v) 

= QL\a(U\, V ) + 02 * 2 ( 112 , v) 

= oi(Hiii, v) + 02(21112, v) 

= (a\Aui + 02A112, v). 

(b) A is a bounded (i.e. continuous) map, since, for any 11 E H, 

||Hii || 2 = ( Au , Au) = a(n, Au) < C\\u\\ || xTn || , 
so \\Au\\ < C||n||. 

(c) A is injective, since, for any 11 E H, 

\\Au\\ ||n|| > \(Au,u)\ = |a(n,n)| > c|| 11 1| 2 , 


so Au = 0 =^> 11 = 0. 

(d) The range of A , ranH C FL, is closed. Consider a convergent sequence 
( v n)neN in ranH that converges to some v E FL. Choose u n E FL such 
that Au n = v n for each n E N. The sequence (Au n ) ne ^ is Cauchy, so 


| Au n 


Au 


m 




A. | (-Alin Au m , U n 
— | a(u n u rn , u n 
T C||ll n Hm|| • 
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So c\\u n — u m \\ < \\v n — v m || -g 0. So (u n ) ne ^ is Cauchy and converges to 
some u G Ft. So v n = Au n Au = v by the continuity (boundedness) of 
A, so u G ran A, and so ran A is closed. 

(e) Finally, A is surjective. Since Ft is Hilbert and ran A is closed, if ran A ^ 
Ft, then there must exist some non-zero s Ft such that s _L ran A. But 
then 

c|| <s || 2 < a(s,s) = (s,As) = 0, 
so s = 0, a contradiction. 

Now, to summarize, take / G Ft'. By the Riesz representation theorem, 
there is a unique w G Ft such that (w,v) = (f\v) for all v G Ft. Since 
A is invertible, the equation Au = w has a unique solution u G Ft. Thus, 
(Au, v) = (/ | v) for all v G 1~L. But (Au, v) — Qj{u,v^j. So there is a unique 
such that a(u,v ) = (f\v). 

The proof of the estimate ||w||k < c _1 ||/||^/ is left as an exercise 
(Exercise 12.9). □ 


Galerkin Projection. Now consider the problem of finding a good approx- 
imation to u in a prescribed subspace Um C l~i of finite dimension — as 
we must necessarily do when working discretely on a computer. We could, 
of course, consider the optimal approximation to u in Um , namely the ort- 
hogonal projection of u onto Um • However, since u is not known a priori, 
and in any case cannot be stored to arbitrary precision on a computer, this 
‘optimal’ approximation is not much use in practice. 

An alternative approach to approximating u is Galerkin projection: we 
seek a Galerkin solution u ~ G Um, an approximation to the exact 

solution u , such that 

a(i/ M \ v^ M ^) = (/ | for all G Um- (12.12) 


Note that if the hypotheses of the Lax-Milgram theorem are satisfied on the 
full space T-L, then they are certainly satisfied on the subspace Um, thereby 
ensuring the existence and uniqueness of solutions to the Galerkin problem. 
Note well, though, that existence of a unique Galerkin solution for each M G 
No does not imply the existence of a unique weak solution (nor even multiple 
weak solutions) to the full problem; for this, one typically needs to show that 
the Galerkin approximations are uniformly bounded and appeal to a Sobolev 
embedding theorem to extract a convergent subsequence. 

Example 12.10. (a) The Fourier basis {ek}kez of Tp er ([0, 2 tt] , dx; C), the 
space of complex- valued 27r-periodic functions on [0,27 r], is defined by 


e k {x) 


, exp (ikx). 

y/2l T 


3 Usually, but not always, the convention will be that dim Um — sometimes, alternative 
conventions will be followed. 
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For Galerkin projection, one can use the (2 M + l)-dimensional subspace 

Um := span{e_M, • • • , e_i, eo, ei, . . . , cm} 

of functions that are band-limited to contain frequencies at most M. In 
case of real- valued functions, one can use the functions 


x i— >■ cos (fcr), for k G No, 

x sin (kx), for k G N. 

(b) Fix a partition a = x o < aq < • • • < = b of a compact interval 

[a, 6] C R and consider the associated tent functions defined by 


0, if x < a or x < x m _i 

T X rn — i 


'm 


(x) ■■= < 


%m %m — 1 
Tm+l T 


, if x m -i < x < x m ; 

i if %m — % — ■ I 'rn+ 1 ? 


0, if x > b or x > x m +i- 


The function (p m takes the value 1 at x m and decays linearly to 0 along the 
two line segments adjacent to x rn . The (M + l)-dimensional vector space 
Um := span{0o, • • • , 4>m} consists of all continuous functions on [a, b } that 
are piecewise affine on the partition, i.e. have constant derivative on each 
of the open intervals (x m - 1 , x m ). The space Um •= span{0i, . . . , 0 m- 1 } 
consists of the continuous functions tlmt piecewise affine on the partition 
and take the value 0 at a and b ; hence, Um is one good choice for a finite- 
dimensional space to approximate the Sobolev space Hq ([a, b]). More 
generally, one could consider tent functions associated with any simplicial 
mesh in M n . 

Another viewpoint on the Galerkin solution vf M ^ is to see it as the projec- 
tion Pu of some u G 7~l, where P: H —> Um denotes projection (truncation), 
and the adjoint operator P* is the inclusion map in the other direction. 
Suppose for simplicity that the operator A corresponding to the bilinear 
form a, as constructed in the proof of the Lax-Milgram theorem, is a self- 
adjoint operator. If we were to try to minimize the A- weighted norm of the 
residual, i.e. 

find u G T-L to minimize \\Pu — u\\a , 
then Theorem 4.28 says that u satisfies the normal equations 


P*APu = P* An 

i.e. P*Au { m) = P*f, 

and the weak interpretation of this equation in 7~t' is that it should hold as 
an equality of scalars whenever it is tested against any v £ P = P", 
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i.e. (v | P* Au^ m ^) = (v | P* f) for all v G P, 

i.e. ( Pv | Au = ( Pv | /) for all v G 7/, 

i.e. (u (M) | An (M) ) = (u (M) | /) for all u (M) G U M - 

Abusing notation slightly by writing these dual pairings as inner products in 
P yields that the weak form of the normal equations is 

{Au^ M \v^ M ^) = (/, for all G 

and since (Au^ M \v^ M ^) = a(uS M \ this is exactly the Galerkin prob- 
lem (12.12) for u^ M \ That is, the Galerkin problem (12.12) for is the 

weak formulation of the variational problem of minimizing the norm of the 
difference between the approximate solution and the true one, with the norm 
being weighted by the operator corresponding to the bilinear form a. 

From this variational characterization of the Galerkin solution, it follows 
immediately that the error u — is a-orthogonal to the approximation 
subspace Um • for any choice of G Um Q P, 

a(u — vS M \ v^ M ^) = a(n, v^ M ^) — 

= (f\v (M) )-(f |d M) ) 

= 0 . 


However, note well that is generally not the optimal approximation 

of u from the subspace Um with respect to the original norm on P, i.e. 


u — u 


(M) 


+ inf ( 


u — V 


(M) 



The optimal approximation of u from Um is the orthogonal projection of u 
onto Um ; if P has an orthonormal basis {e n } and u = ^ nGN 'U n e n , then the 

optimal approximation of u in Um = spanjei, . . . , e^} is uUe m but 

this is not generally the same as the Galerkin solution u^ M \ However, the 
next result, Cea’s lemma, shows that is a quasi-optimal approximation 
to u (note that the ratio C / c is always at least 1): 

Lemma 12.11 (Cea’s lemma). Let a, c and C be as in the statement of 
the Lax-Milgram theorem. Then the weak solution u G P and the Galerkin 
solution G Um satisfy 


u — u 


(M) 


< - inf 

c 


u — V 


(M) 


V 


(M) 


£ Gm 


Proof. Exercise 12.11. 


□ 


12.3 Lax— Milgram Theory and Random PDEs 


267 


Matrix Form. It is helpful to cast the Galerkin problem in the form of a 
matrix- vector equation by expressing it in terms of a basis {0 1 , . . . , 0m} of 
IAm - Then u = up solves the Galerkin problem if and only if 


d(u, 4> m ) = (/ \<t>m) for m G {1 

Now expand u in this basis as u — X]m=i u m4>m and insert this into the 
previous equation: 

( M \ M 

E u m4>mAi ] = E u ma{4>m, <f>i) = (f \ <j>i)) for i G {1 

m— 1 / m=l 

That is, the column vector u := [rq, . . . ,um] T G M a/ of coefficients of u in 
the basis , 0 m} solves the matrix- vector equation 


</ 1 0 i > 

(/ 1 4 > m ) 


(12.13) 


where the matrix 


a := 


a(0i,0i) 


01 ) 


&(01,0m) ••• ^(0m,0m) 


G 


,Mx M 


is the Gram matrix of the bilinear form a, and is of course a symmetric matrix 
whenever a is a symmetric bilinear form. 

Remark 12.12. In practice the matrix- vector equation au = b is never 
solved by explicitly inverting the Gram matrix a to obtain the coefficients 
u m via u = a~ 1 b. Even a relatively naive solution using a Cholesky factor- 
ization of the Gram matrix and forward and backward substitution would 
be cheaper and more numerically stable than an explicit inversion. Indeed, 
in many situations the Gram matrix is sparse, and so solution methods that 
take advantage of that sparsity are used; furthermore, for large systems, the 
methods used are often iterative rather than direct. 


Stochastic Lax— Milgram Theory. The next step is to build appropriate 
Lax-Milgram theory and Galerkin projection for stochastic problems, for 
which a good prototype is 

—V • (k,(6,x)Vu(Q,x)) = f(0,x) for x G X, 

u(x) =0 for x G dX, 
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with 0 being drawn from some probability space ((9, To that end, we 

introduce a stochastic space S , which in the following will be L 2 ((9,/i;M). 
We retain also a Hilbert space U in which the deterministic solution u{0) is 
sought for each 6 G (9; implicitly, U is independent of the problem data, or 
rather of 0. Thus, the space in which the stochastic solution U is sought is 
the tensor product Hilbert space Ti := U <8> S, which is isomorphic to the 
space L 2 (0, p]U) of square- integr able ZY- valued random variables. 

In terms of bilinear forms, the setup is that of a bilinear- form-on-W-valued 
random variable A and a lA'- valued random variable F. Define a bilinear form 
a on n by 


a(X,Y) :=E^[A(X,Y)\= [ A(0)(X(0),Y(0)) d^O) 

Je 

and, similarly, a linear functional ft on Ti by 

(P\Y) := E Al [(F | Y)u]- 

Clearly, if a satisfies the boundedness and coercivity assumptions of the Lax- 
Milgram theorem on Ti, then, for every F G T 2 ((9, fi\ U'), there is a unique 
weak solution U G L 2 (0,/i;Z7) satisfying 

a(U,Y) = (P\Y) for all Y G L 2 (0, /x;W). 

A sufficient, but not necessary, condition for a to satisfy the hypotheses of the 
Lax-Milgram theorem on l~t is for A(0) to satisfy those hypotheses uniformly 
in 6 on U: 

Theorem 12.13 (‘Uniform’ stochastic Lax-Milgram theorem). Let (0, fi) 
be a probability space, and let A be a random variable on O, taking values 
in the space of bilinear forms on a Hilbert space U, and satisfying the hyp- 
otheses of the deterministic Lax-Milgram theorem (Theorem 12.9) uniformly 
with respect to 6 G S. Define a bilinear form a and a linear functional 3 on 

L 2 (e,r,u) by 

a(X,Y) :=E „[A(X,Y)}, 

(P\Y) := E^[(F \Y) U ]. 

Then, for every F G L 2 (0, p]U f ), there is a unique U G L 2 (G, n\U) such 
that 

a(U, V) = (P\V) for all V G L 2 (<9, /x; U). 

Proof. Suppose that A(0) satisfies the boundedness assumption with con- 
stant C{0) and the coercivity assumption with constant c(0). By hypothesis, 

C' := supC(^) and 
oe& 

c := inf c(0) 
oee 
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are both strictly positive and finite. Then a satisfies, for all X,Y EH, 


a(X,Y) = E fl [A(X,Y)} 


<^[C\\X\\ u \\Y\\u_ 

< C'Eyilxil^E^iy 


2 

U 


1/2 


= C'\\X\\ H \\Y\\ H , 


and 


a(X,X) = E^[A(X,X)] 

>Ev[c\\X\\l] 

> c'\\X\\ 2 u . 

Hence, by the deterministic Lax-Milgram theorem applied to the bilinear 
form a on the Hilbert space H, for every F E L 2 (0 , there exists a 
unique U E L 2 {S,fi]U) such that 

a(U,V) = {P\V) for all V E L 2 (0,^U), 

which completes the proof. □ 

Remark 12.14. Note, however, that uniform boundedness and coercivity of 
A are quite strong assumptions, and are not necessary for a to be bounded 
and coercive. For example, the constants c(0) and C{6) may degenerate to 
0 or oo as 6 approaches certain points of the sample space O. Provided that 
these degeneracies are integrable and yield positive and finite expected values, 
this will not ruin the boundedness and coercivity of a. Indeed, there may be 
an arbitrarily large (but //-measure zero) set of 0 for which there is no weak 
solution u{6) to the deterministic problem 

A(0)(u(0),v) = (F(0) | v) for all v EU. 

Stochastic Galerkin Projection. Let Um be a finite-dimensional subspace 
of U , with basis {</> i, . . . , As indicated above, take the stochastic space 
S to be L 2 ((9,/i;M), which we assume to be equipped with an orthogonal 
decomposition such as a gPC decomposition. Let Sk be a finite-dimensional 
subspace of S, for example the span of a system of orthogonal polynomials 
up to degree K. The Galerkin projection of the stochastic problem on H is 
to find 

u « U [M,K) = u mk (j) m (g) > <E U M ® S K 

k=0,...,K 

such that 

a(U {M ’ K \V) = <fi\V) for all V G U M ® S K . 
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In particular, it suffices to find U (yM,K " > that satisfies this condition for each 
basis element V = 0 n ® &£ of Um <8> Sk- Recall that 0 n ® \p£ is the function 
(6,x) \-> (f> n {x)^t{6). 

Matrix Form. Let ol E R m ( k + 1 ) x m(k+ i) p e Q ram matrix of the bilinear 

form a with respect to the basis {0 m ® ^ | m — 1, . . . M; fc = 0, . . . , iL} of 
Z4m As before, the Galerkin problem is equivalent to the matrix- vector 

equation 

olU = /3, 

where U E R M(A +i) j s the column vector comprised of the coefficients u m k 
and (3 E R m ( a: + 1 ) has components (/? | 0 m ® $&). It is natural to ask: how is 
the Gram matrix ol related to the R MxM - valued random variable A that is 
the Gram matrix of the random bilinear form A? 

It turns out that there are two natural ways to formulate the answers 
to this problem: one formulation is a block-symmetric matrix in which the 
stochastic modes are not properly normalized; the other features the prop- 
erly normalized stochastic modes and the multiplication tensor, but loses the 
symmetry. 

Symmetric Formulation. Suppose that, for each fixed 6 E (9, the deter- 
ministic problem, discretized and written in matrix-vector form in the basis 
{</>i, • • • , 0m} of Um ? is 

A(0)U(6) = B{0). 

Here, the Galerkin solution is U(6) E Um and U{6) E is the column 
vector of coefficients of U ( 6 ) with respect to {0i, . . . , 0m}- Write the Galerkin 
solution U E Um ® Sr as U — and further write Uk E M M for 

the column vector corresponding to the stochastic mode Uk E Um in the 
basis {0i,..., 0m}, so that U = u k&k- Galerkin projection — more 

specifically, testing the equation AU = B against — reveals that 


K 

= (B*k) for each k E {0, . . . , K}. 
3 = o 

This is equivalent to the (large!) block system 


(^-)oo 

(X)qk 


Uo 


' (m o) ' 


■ U)kk. 


_ U K 


_(BVk)_ 


(12.14) 


(A) y := i&iA&j) G R MxM 


where, for 0 < i, j < K, 
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Note that, in general, the stochastic modes Uj of the solution U (and, 
indeed the coefficients Uj m of the stochastic modes in the deterministic basis 
{01, ... , 0 m}) are all coupled together through the matrix on the left-hand 
side of (12.14). Note that this matrix is block-symmetric, since clearly 

(A) zj := (^A^) = {A)... 

However, the entries (B&k) on the right-hand side of (12.14) are not the 
stochastic modes b k £ M m of B , since they have not been normalized by ) . 

Multiplication Tensor Formulation. On the other hand, we can con- 
sider the case in which the random Gram matrix A has a (truncated) gPC 
expansion 

K 

A — ^ dk&k 

k = 0 

with coefficient matrices 


dk 


(A*k) 

(n) 


e M 


M x M 


In this case, the blocks (A) k • in (12.14) are given by 

K 

{A) kj = {VkA&j) = 

i = 0 


Hence, the Galerkin block system (12.14) is equivalent to 


(A)oo 

(A)ok 


Uo 


bo 

_(A)ko • 

(A)kk_ 


_ U K 


b R 


(12.15) 


where bk = £ M m is the column vector of coefficients of the k th 

stochastic mode bk of B in the basis {0i, . . . , 0 m} of Um , and 


K 

{-A) kj • ^ ^ Aljj k dj , 

2 = 0 


where 


(Wpk) 
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is the multiplication tensor for the basis | k E No}. Thus, the system 
(12.15) is the system 

K 

^ ^ Mijk&iUj — bk 

i,j = 0 

i.e. the pseudo-spectral product A*U = B. 

The matrix in (12.15) is not block-symmetric, since the k th block row is 
normalized by (^|), and in general the normalizing factors for each block 
row will be distinct. On the other hand, formulation (12.15) has the advan- 
tage that the properly normalized stochastic modes of A, U and B appear 
throughout, and it makes clear use of the multiplication tensor M^. 

Example 12.15. As a special case, suppose that the random data have 
no impact on the differential operator and affect only the right-hand side 
B = EfceNo In this case the random bilinear form 9 A(Q )(- , ■) is 

identically equal to one bilinear form a( • , • ), so the random Gram matrix A 

is a deterministic matrix a, and so the blocks {A) i - in (12.14) are given by 

Ay := = a{Wj) = aSiji*?). 

Hence, the stochastic Galerkin system, in its block-symmetric matrix form 
(12.14), becomes the block-diagonal system 


a 0 ... 0 


u 0 


' (B%) ' 

0 a($f ) : 


Ui 


(B^) 

: 0 


• 


• 

1 

o 

o 

© 


Uk_ 


_(BV k )_ 


In the alternative formulation (12.15), we simply have 


a 

0 ... o" 


u 0 


1 

o 

1 

0 

a * • : 


Ui 


bi 

• 

0 


• 


• 

0 

0 a 


Uk_ 


Pk_ 


Hence, the stochastic modes Uk decouple and are given by Uk = a~ 1 bf £ . Thus, 
in this case, any pre-existing solver for the deterministic problem au = b can 
simply be re-used c as is’ K + 1 times with b = bk for k = 0, . . . , K to obtain 
the Galerkin solution of the stochastic problem. 
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12.5 Exercises 

Exercise 12.1. Let 7 = A/"(0, 1 ) be the standard Gaussian measure on R, and 
let {He n } nG ^ 0 be the associated orthogonal system of Hermite polynomials 
with (He^) = n\. Show that 


XT TJ X _ ih'W- 

* j } (s - i)\(s - j)\(s - k)\ 

whenever 2s = i + j + k is even, i+j>k,j + k>i, and k + i > j\ and 
zero otherwise. Hence, show that the Galerkin multiplication tensor for the 
Hermite polynomials is 



I" ( s _i)!(7j)!(s-fe)!> if 2s = i + j + k e 2Z, * + j > k, 

Mijk = < j + k > i, and k + i > j, 

[0, otherwise. 

Exercise 12.2. Show that the multiplication tensor M ^ is covariant in the 
indices i and j and contravariant in the index k. That is, if | k E No} and 
{& k \k G No} are two orthogonal bases and A is the change-of-basis matrix in 
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the sense that Sfrj — then the corresponding multiplication tensors 

Mijk and Mij k satisfy 



^ ^ AmiAnj(A ) kp-^-', 


mnp‘ 


m,n,p 


(Thus, the multiplication tensor is a (2, l)-tensor and differential geometers 
would denote it by Mf - .) 

Exercise 12.3. (a) Show that, for fixed K , the Galerkin product satisfies 
for all C7, V, W G Sk and a, /? G R, 


u*v = n SK (uv), 
u *v = v *u, 

(aU) * (PV) = a/3(U * V), 

(N + H)*lH = N*tH + H*VE. 


(b) Show that the Galerkin product on Sk is not associative, i.e. 


U * (V * W) ^ (U * V) * IT for some U,V,W e S K . 


To do so, show that 


K 

(U * V) * W = J2 

m— 0 


K 

i,j,k ,£= 0 




m ? 


K 

U*(V*W)=J2 

m = 0 


K 

i,j,k,£=0 




m 


Show that the two (3, l)-tensors Ylf=o Mij^M^ krn and o jtMum 
need not be equal. 

(c) Show that the Galerkin product on Sk is not power-associative, by 
finding U G Sk for which 


{{U * U) * U) * U ^ (U * U) * (U * C7). 


Hint: Counterexamples can be found using the Hermite multiplication 
tensor from Exercise 12.1 in the case K = 2 . 

Exercise 12.4. The operation of Galerkin inversion can have some patho- 
logical properties. Let £ ~ A/"(0, 1), and let S := L 2 (M, 7 ; R) have its usual 
orthogonal basis of Hermite polynomials {Re k \ k G No}. Following the dis- 
cussion in Remark 9.16, let a G R, and let 
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Using (12.3), determine — or, rather, attempt to determine — the Hermite- 
Galerkin reciprocal V := U~ l in Sk for several values of K £ N and a £ R 
(make sure to try a = 0 for some especially interesting results!). For a — 0, 
what do you observe about the invertibility of the matrix in (12.3) when K is 
odd or even? When it is invertible, what is its condition number (the product 
of its norm and the norm of its inverse)? How does ^o, which would equal 
E[U] if V were an integrable random variable, compare to a -1 when a ^ 0? 

Exercise 12.5. Following the model of Galerkin inversion, formulate a 
Galerkin method for calculating the spectral coefficients of a degree-lL 
Galerkin approximation to U/V given truncated spectral expansions U = 

Ef=o u k^k and V = J2k=o v k^k- 

Exercise 12.6. Formulate a method for calculating a pseudo-spectral app- 
roximation to the square root of a non- negative random variable. Apply your 
method to calculate the Hermite spectral coefficients of a degree - K Galerkin 
approximation to y / expT7, where U ~ A/”(m,cr 2 ). 

Exercise 12.7. Extend Example 12.7 to incorporate uncertainty in the ini- 
tial position and velocity of the oscillator. Assume that the initial position 
X and initial position V are independent random variables with (truncated) 

gPC expansions X = an d V = Y^f=o - Expand the solu- 
tion of the oscillator equation in the tensor product basis := 

and calculate ANOVA-style sensitivity indices, i.e. 


s 


2 

(ho) 


u 


|2 h Tr ( x > v ) 


W)i Xj) 



2 (ty( x >v) 

' (hj) 


Exercise 12.8. Perform the analogues of Example 12.7 and Exercise 12.7 
for the Van der Pol oscillator 


u(t) — n(l — u(t) 2 )u{t) + uj 2 u(t) = 0, 

with natural frequency u > 0 and damping fi > 0. Model both c u and fi as 
random variables with gPC expansions of your choice, and, for various T > 0, 
calculate sensitivity indices for u(T) with respect to the uncertainties in a;, 
/i, and initial data. 

Exercise 12.9. Let a be a bilinear form satisfying the hypotheses of the 
Lax-Milgram theorem. Given / £ FG, show that the unique u such that 
a(iq v) = (/ | v) for all v £ l~i satisfies IMI H < C 1 \\f\\ W • 

Exercise 12.10 (Lax-Milgram with two Hilbert spaces). Let U and V be 
Hilbert spaces, and let a : U x V K be a bilinear form such that there exist 
constants 0 < c < C < oo such that, for all u £ U and v £ V, 

C IMMMI v A \a(u,v)\ < C||^||^||'e||v- 
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By following the steps in the proof of the usual Lax-Milgram theorem, show 
that, for all / G V', there exists a unique u G U such that, for all v G V, 
a(u,v) = (f \v), and show also that this u satisfies the estimate \\u\\u < 

c_1 ||/llv'- 

Exercise 12.11 (Cea’s lemma). Let a, c and C be as in the statement of the 
Lax-Milgram theorem. Show that the weak solution u £ H and the Galerkin 
solution G Um satisfy 


u — u 


(M) 


< £ inf 
c 


{ 


u — V 


(M) 



Exercise 12.12. Consider a partition of the unit interval [0, 1] into TV + 1 
equally spaced nodes 


0 = xq < x\ = h < X 2 = 2h < • • • < xn = 1, 
where h = -h > 0. For n = 0, . . . , TV, let 


(f>n(x) := < 


fo, 

(x - x n -i)/h, 
(x n + i - x)/h, 

0, 


if x < 0 or x < x n -i] 
if x n _i < x < x n \ 
if x n < x < x n +i] 
if x > 1 or x > x n + 1 . 


What space of functions is spanned by </>o, . . . , For these functions 
0o, . . . , calculate the Gram matrix for the bilinear form 

a(u,v) := / u'(x)v'(x) dx 

Jo 

corresponding to the Laplace operator. Determine also the vector components 
(/, 0n) hr the Galerkin equation (12.13). 


Chapter 13 

Non-Intrusive Methods 


[W]hen people thought the Earth was flat, 
they were wrong. When people thought the 
Earth was spherical, they were wrong. But if 
you think that thinking the Earth is spherical 
is just as wrong as thinking the Earth is flat, 
then your view is wronger than both of them 
put together. 


The Relativity of Wrong 
Isaac Asimov 


Chapter 12 considers a spectral approach to UQ, namely Galerkin expansion 
that is mathematically very attractive in that it is a natural extension of the 
Galerkin methods that are commonly used for deterministic PDEs and (up 
to a constant) minimizes the stochastic residual, but has the severe disad- 
vantage that the stochastic modes of the solution are coupled together by a 
large system such as (12.15). Hence, the Galerkin formalism is not suitable for 
situations in which deterministic solutions are slow and expensive to obtain, 
and the deterministic solution method cannot be modified. Many so-called 
legacy codes are not amenable to such intrusive methods of UQ. 

In contrast, this chapter considers non-intrusive spectral methods for UQ. 
These are characterized by the feature that the solution U{6) of the deter- 
ministic problem is a ‘black box’ that does not need to be modified for use 
in the spectral method, beyond being able to be evaluated at any desired 
point 6 of the probability space ((9,d^,/i). Indeed, sometimes, it is necessary 
to go one step further than this and consider the case of legacy data , i.e. an 
archive or data set of past input-output pairs {(0 n , U(6 n )) \ n = 1, . . . , TV}, 
sampled according to a possibly unknown or sub-optimal strategy, that is 
provided ‘as is’ and that cannot be modified or extended at all: the reasons 
for such restrictions may range from financial or practical difficulties to legal 
and ethical concerns. 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-13 
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There is a substantial overlap between non-intrusive methods for UQ and 
deterministic methods for interpolation and approximation as discussed in 
Chapter 8. However, this chapter additionally considers the method of Gaus- 
sian process regression (also known as kriging\ which produces a probabilis- 
tic prediction of U{6) away from the data set, including a variance-based 
measure of uncertainty in that prediction. 


13.1 Non-intrusive Spectral Methods 

One class of non-intrusive UQ methods is the family of non-intrusive spectral 
methods , namely the determination of approximate spectral coefficients, e.g. 
polynomial chaos coefficients, of an uncertain quantity U . The distinguishing 
feature here, in contrast to the approximate spectral coefficients calculated in 
Chapter 12, is that realizations of U are used directly. A good mental model 
is that the realizations of U will be used as evaluations in a quadrature rule, 
to determine an approximate orthogonal projection onto a finite-dimensional 
subspace of the stochastic solution space. For this reason, these methods are 
sometimes called non-intrusive spectral projection (NISP). 

Consider a square-integrable stochastic process U : O -A U taking values 
in a separable Hilbert space 1 ZY, with a spectral expansion 

U = ^ u k V k 

ke No 

of U G T 2 (0,|qZ- / /) — IA &) T 2 (0, /q IK.) in terms of coefficients (stochastic 
modes) u k G U and an orthogonal basis | k G No} of L 2 ((9,/qM). As 
usual, the stochastic modes are given by 

Uk = VPV = V / u ^(0) MO)- (13.1) 

\* k ) IkJe 

If the normalization constants j k := (^f) = \\&k lli 2 ( /x ) are known ahead of 
time, then it remains only to approximate the integral with respect to /i of 
the product of U with each basis function iZq; in some cases, the normal- 
ization constants must also be approximated. In any case, the aim is to use 
realizations of U to determine approximate stochastic modes u k G ZY, with 
Uk Uk, and hence an approximation 

u := ukZk « U. 

keNo 

Such a stochastic process U is sometimes called a surrogate or emulator for 
the original process U. 


1 As usual, readers will lose little by assuming that U = R on a first reading. Later, U 
should be thought of as a non-trivial space of time- and space-dependent fields, so that 
U(t,x;0) = 
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Deterministic Quadrature. If the dimension of O is low and U ( 0 ) is rela- 
tively smooth as a function of 0, then an appealing approach to the estima- 
tion of (U&k) is deterministic quadrature. For optimal polynomial accuracy, 
Gaussian quadrature (i.e. nodes at the roots of /r-orthogonal polynomials) 
may be used. In practice, nested quadrature rules such as Clenshaw-Curtis 
may be preferable since one does not wish to have to discard past solutions 
of U upon passing to a more accurate quadrature rule. For multi-dimensional 
domains of integration ( 9 , sparse quadrature rules may be used to partially 
alleviate the curse of dimension. 

Note that, if the basis elements are polynomials, then the normalization 
constant 7^ := (\P%) can be evaluated numerically but with zero quadrature 
error by Gaussian quadrature with at least (fc + l)/2 nodes. 

Monte Carlo and Quasi-Monte Carlo Integration. If the dimension 
of O is high, or U ( 6 ) is a non-smooth function of 0, then it is tempting 
to resort to Monte Carlo approximation of ([/$%). This approach is also 
appealing because the calculation of the stochastic modes can be writ- 
ten as a straightforward (but often large) matrix-matrix multiplication. The 
problem with Monte Carlo methods, as ever, is the slow convergence rate of 
~ (number of samples) -1 / 2 ; quasi-Monte Carlo quadrature may be used to 
improve the convergence rate for smoother integrands. 


Connection with Linear Least Squares. There is a close connection 
between least-squares minimization and the determination of approximate 
spectral coefficients via quadrature (be it deterministic or stochastic). Let 


basis functions &o, . . . 

k and nodes 0\ , . . . , 0 


>o(0i) •• 

1) 

V := 

; 

• 


$0 (On) •• 

^k{0n) 


G 


,Nx(K+l) 


( 13 . 2 ) 


be the associated Vandermonde-like matrix. Also, let Q(f) := X^=i w nfifn) 
be an TV-point quadrature rule using the nodes #i,...,#/v, and let W := 
diag(rei, . . . , uyv) G M 7Vx7V . For example, if the 0 n are i.i.d. draws from the 
measure fi on 0, then 

1 

wi = * * * = w N = — 

corresponds to the ‘vanilla’ Monte Carlo quadrature rule Q. 

Theorem 13.1. Given observed data y n := U (6 n ) for n = 1, . . . , TV, and y = 
[2/1 , . . . , ?/jv]> the following statements about approximate spectral coefficients 
u = (uq, . . . , uk) for U := uipPk ore equivalent: 

(a) U minimizes the weighted sum of squared residuals 
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(b) u satisfies 

V T WVu = V T Wy T ; (13.3) 

(c) U = U in the weak sense , tested against $o, . . . , d / k using the quadrature 
rule Q, i.e., for k = 0 , . . . , K , 

Q(*ki r)=Q(v k u). 


Proof. Since 


Vu 


U(0 1) 
U{0n) 


the weighted sum of squared residuals w n \ U ( 9 n )—y n “ for approximate 

model U equals \\Vu — y T ||^. By Theorem 4.28, this function of u is mini- 
mized if and only if u satisfies the normal equations (13.3), which shows that 
(a) (b). Explicit calculation of the left- and right-hand sides of (13.3) 

yields 


N 

>0 (0n)U(0 n )' 

N 

(9n)y n 

y^ w n 

■ 

= y^ W n 

• 

n— 1 

# K (0n)U(0n) 

n— 1 

d r K{0 n )y n _ 


which shows that (b) (c) □ 


Note that the matrix V T WV on the left-hand side of (13.3) is 


V T WV = 


Q(*o*o) 


Q(M o) 


Q^k) 


£ Jg>(^+l)x(iC+l)^ 


Q(Mk) 


i.e. is the Gram matrix of the basis functions $o, . . . , with respect to the 
quadrature rule Q’ s associated inner product. Therefore, if the quadrature 
rule Q is one associated to // (e.g. a Gaussian quadrature formula for /x, or a 
Monte Carlo quadrature with i.i.d. 9 n ~ /x), then V T WV will be an approx- 
imation to the Gram matrix of the basis functions d r o 1 . . . , \Pk in the L 2 (/x) 
inner product. In particular, dependent upon the accuracy of the quadrature 
rule Q, we will have V T WV ~ diag(yo, . . . , Jk), and then 

~ QOW 

u k « , 

Ik 

i.e. u k approximately satisfies the orthogonal projection condition (13.1) 
satisfied by u k . 
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In practice, when given {0 n }^ =1 that are not necessarily associated with 
some quadrature rule for /q along with corresponding output values {y n := 
U(O n )}n= u it is common to construct approximate stochastic modes and 
hence an approximate spectral expansion U by choosing uq, ... to minimize 
the some weighted sum of squared residuals, i.e. according to (13.3). 

Conversely, one can engage in the design of experiments — i.e. the selection 
of {O n }n=i — to optimize some derived quantity of the matrix V ; common 
choices include 

• A-optimality, in which the trace of (V T V)~ 1 is minimized; 

• D-optimality, in which the determinant of V T V is maximized; 

• E-optimality, in which the least singular value of V T V is maximized; and 

• G-optimality, in which the largest diagonal term in the orthogonal pro- 
jection V(V T V)~ 1 V T £ M 7Vxiv is minimized. 

Remark 13.2. The Vandermonde-like matrix V from (13.2) is often ill- 
conditioned, i.e. has singular values of hugely different magnitudes. Often, this 
is a property of the normalization constants of the basis functions {d/k}k=o- 
As can be seen from Table 8.2, many of the standard families of orthogonal 
polynomials have normalization constants H^Hl 2 that tend to 0 or to oo 
as k oo. A tensor product system {V'ajaeNg of multivariate orthogonal 
polynomials, as in Theorem 8.25, might well have 

liminf \\^ a \\ l 2 = 0 and limsup H^IIl 2 = co; 

1^1^°° |a;|— ^oo 

this phenomenon arises in, for example, the products of the Legendre and 
Hermite, or the Legendre and Charlier, bases. Working with orthonormal 
bases, or using preconditioners, alleviates the difficulties caused by such ill- 
conditioned matrices V. 

Remark 13.3. In practice, the following sources of error arise when com- 
puting non-intrusive approximate spectral expansions in the fashion outlined 
in this section: 

(a) discretization error comes about through the approximation of U by a 

finite-dimensional subspace IAm > be. the approximation the stochastic 
modes by a finite sum Uk ~ u km<t>m, where {</> m | m £ N} is 

some basis for 

(b) truncation error comes about through the truncation of the spectral 

expansion for U after finitely many terms, i.e. U Ylk=0 u k¥k\ 

(c) quadrature error comes about through the approximate nature of the 
numerical integration scheme used to find the stochastic modes; classical 
statistical concerns about the unbiasedness of estimators for expected 
values fall into this category. The choice of integration nodes contributes 
greatly to this source of error. 

A complete quantification of the uncertainty associated with predictions of U 
made using a truncated non- intrusively constructed spectral stochastic model 
U := XlfcLo tik&k requires an understanding of all three of these sources of 
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error, and there is necessarily some tradeoff among them when trying to give 
‘optimal’ predictions for a given level of computational and experimental cost. 

Remark 13.4. It often happens in practice that the process U is not initially 
defined on the same probability space as the gPC basis functions, in which 
case some appropriate changes of variables must be used. In particular, this 
situation can arise if we are given an archive of legacy data values of U 
without the corresponding inputs. See Exercise 13.5 for a discussion of these 
issues in the example setting of Gaussian mixtures. 

Example 13.5. Consider again the simple harmonic oscillator 

U(t) = -ft 2 U{t) 

with the initial conditions 17(0) = 1, 17(0) = 0. Suppose that ft ^ 
Unif([0.8, 1.2]), so that ft = 1.0 + 0.2H, where £ ~ Unif([— 1, 1]) is the 
stochastic germ, with its associated Legendre basis polynomials. Figure 13.1 
shows the evolution of the approximate stochastic modes for [/, calculated 
using N = 1000 i.i.d. samples of £ and the least squares approach of The- 
orem 13.1. As in previous examples of this type, the forward solution of the 
ODE is performed using a symplectic integrator with time step 0.01. 

Note that many standard computational algebra routines, such as Python’s 
mimpy . linalg. lstsq, will solve the all the least squares problems of finding 
{uk(ti)}k =0 for all time points A in a vectorized manner. That is, it is not nec- 
essary to call mimpy. linalg. lstsq with matrix V and data {Z7(to, cj n )}n=i 
to obtain {uk{to)}^ =0 , and then do the same for G, etc. Instead, all the data 
{U(ti,oj n ) | n = 1 , . . . , TV; i E No} can be supplied at once as a matrix, 
yielding a matrix {uk(U) I = 0, . . . , if; z E No}. 


13.2 Stochastic Collocation 

Collocation methods for ordinary and partial differential equations are 
another form of interpolation. The idea is to find a low-dimensional object — 
usually a polynomial — that approximates the true solution to the differential 
equation by means of exactly satisfying the differential equation at a selected 
set of points, called collocation points or collocation nodes. An important 
feature of the collocation approach is that an approximation is constructed 
not on a pre-defined stochastic subspace, but instead uses interpolation, and 
hence both the approximation and the approximation space are implicitly 
prescribed by the collocation nodes. As the number of collocation nodes inc- 
reases, the space in which the solution is sought becomes correspondingly 
larger. 
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U k (t) 



The solid curve shows the mean uo(t) of the solution, the dashed curves show 
the higher-degree Legendre coefficients itfc(t), and the grey envelope shows the 
mean d= one standard deviation. Over the time interval shown, \uk(t)\ < 10 — 2 for 
k > 5. 



Contour plot of the truncated NISP model U(t) = y I^°_ n 'hfc(^)L e fc- 

Fig. 13.1: The degree- 10 Legendre PC NISP solution to the simple harmonic 
oscillator equation of Example 13.5 with ft ^ Unif([0.8, 1.2]). 
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Example 13.6 (Collocation for an ODE). Consider, for example, the initial 
value problem 


ii(t) = f(t,u(t)), for t G [a, b] 

u(a) = u a , 

to be solved on an interval of time [a, b\. Choose n points 


a < t\ < £2 < • • • < t n < 6, 


called collocation nodes. Now find a polynomial p(t) G M< n [t] so that the 

ODE 


P(tk) = f(tk,p(t k )) 

is satisfied for k = 1, . . . , n, as is the initial condition p{a) = u a . For example, 
if n = 2, t\ = a and £2 = b, then the coefficients C 2 ,ci,Co G M of the 
polynomial approximation 


2 

p(t) = J2c k (t - a) k , 

k = 0 

which has derivative p(t) = 2 c 2 (t — a) + ci, are required to satisfy 

p(a) = ci = f(a,p(a )) 
p(6) = 2 c 2 (6 - a) + ci = f(b,p(b)) 
p(a) = c 0 = u a 


i.e. 


p(4 


/(fr,p(fe)) - fjopUg) 
2(6 — a) 


(i - a) 2 + /(a, «„)(< - a) + u< 


The above equation implicitly defines the final value p(b) of the collocation 
solution. This method is also known as the trapezoidal rule for ODEs, since 
the same solution is obtained by rewriting the differential equation as 


u(t) = u(a) + / f(s,u(s))ds 

J a 

and approximating the integral on the right-hand side by the trapezoidal 
quadrature rule for integrals. 

It should be made clear at the outset that there is nothing stochastic about 
‘stochastic collocation’, just as there is nothing chaotic about ‘polynomial 
chaos’. The meaning of the term ‘stochastic’ in this case is that the colloca- 
tion principle is being applied across the ‘stochastic space’ (i.e. the proba- 
bility space) of a stochastic process, rather than the space/time/space-time 
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domain. That is, for a stochastic process U with known values U (0 n ) at known 
collocation points # 1 , . . . , On £ 0, we seek an approximation U such that 

U(0 n ) = u(0 n ) for n = 1, . . . , TV. 

There is, however, some flexibility in how to approximate £70) for 0 ^ 
#1, • • • , 0ao 

Example 13.7. Consider, for example, the random PDE 


Cq\U{x,6)\ =0 for x £ A, 0 £ 0, 

Bq[U(x,Q)\ =0 for x £ dX, 6^0, 


where, for /x-a.e. 0 in some probability space (0, J^,/i), the differential ope- 
rator £# and boundary operator £>0 are well defined and the PDE admits 
a unique solution £7( • , 0) : X —> R. The solution £7 : A x 0 — R is then a 
stochastic process. We now let 0 m •= {0i, • • . ,0 m} G 0 be a finite set of 
prescribed collocation nodes. The collocation problem is to find a collocation 
solution £7, an approximation to the exact solution C7, that satisfies 


C 9rn [U(x,0 m ) 

Bo m [U (x, 0 m ) 


= 0 
= 0 


for x £ A, 
for x £ dX, 


for m = 1, . . . , M. 


Interpolation Approach. An obvious first approach is to use interpolating 
polynomials when they are available. This is easiest when the stochastic space 
O is one-dimensional, in which case the Lagrange basis polynomials of a given 
nodal set are an attractive choice of interpolation basis. As always, though, 
care must be taken to use nodal sets that will not lead to Runge oscillations; if 
there is very little a priori information about the process £7, then constructing 
a ‘good’ nodal set may be a matter of trial and error. In general, the choice 
of collocation nodes is a significant contributor to the error and uncertainty 
in the resulting predictions. 

Given the values £7(0i), . . . , U(0jy) of U at nodes 0i, . . . , 0jv m a one- 
dimensional space 0, the (Lagrange- form polynomial interpolation) colloca- 
tion approximation U to U is given by 

N N n_n 

U{0) = Y, U(e n )£ n (0) = Yi U{6 n ) n —f. 

n = 1 n= 1 l<k<N Un k 

k^n 

Example 13.8. Figure 13.2 shows the results of the interpolation-collocation 
approach for the simple harmonic oscillator equation considered earlier, again 
for uj G [0.8, 1.2]. Two nodal sets uj i,...,cy/v £ R are considered: uniform 
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nodes, and Chebyshev nodes. In order to make the differences between the 
two solutions more easily visible, only N = 4 nodes are used. 

The collocation solution C/(-,o; n ) at each of the collocation nodes uo n is 
the solution of the deterministic problem 

d 2 - 

—^U(t,u n ) = -ulU(t,u n ), 

U(0,ui n ) = 1, 

— U(0,uj n ) = 0. 

Away from the collocation nodes, U is defined by polynomial interpolation: 
for each t, U(t,u) is a polynomial in c u of degree at most N with pre- 
scribed values at the collocation nodes. Writing this interpolation in terms 
of Lagrange basis polynomials 


£ n (w,wi, . ..u> N ) ■= 


1 <k<N 
k^n 


UJ — UJk 

Ld n 


yields 


N 


U(t,u>) = ^ U (t, uj n )£ n (uj). 


n- 


As can be seen in Figure 13.2(c-d), both nodal sets have the undesir- 
able property that the approximate solution U (£, u) has with the undesirable 
property that \U(t,uS) > |[7(0,u;)| = 1 for some t > 0 and uj G [0.8, 1.2]. 

Therefore, for general cj, U(t,u) is not a solution of the original ODE. How- 
ever, as the discussion around Runge’s phenomenon in Section 8.5 would 
lead us to expect, the regions in (t,ca)- space where such unphysical values 
are attained are smaller with the Chebyshev nodes than the uniformly dis- 
tributed ones. 


The extension of one-dimensional interpolation methods to the multi- 
dimensional case can be handled in a theoretically straightforward manner 
using tensor product grids, similar to the constructions used in quadrature. 
In tensor product constructions, both the grid of interpolation points and 
the interpolation polynomials are products of the associated one-dimensional 
objects. Thus, in a product space O = x • • • x (9^, we take nodes 


0 


l 

l ? 






0 


d 

1 5 


nd 

U N d 


c 0(i 
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a b 




U (t, uS) with uniform (Newton 
Cotes) nodes for cc E [0.8, 1.2]. 


U (t, lu) with Chebyshev nodes 
for cc E [0.8, 1.2]. 



d 



Interpolation of (b) with respect to c j. 


Fig. 13.2: Interpolation solutions for a simple harmonic oscillator with un- 
certain natural frequency c j, U(0,cu) = 1, 17(0, uj) = 0. Both cases use four 
interpolation nodes. Note that the Chebyshev nodes produce smaller regions 
in (£, co)-space with unphysical values \U(t,uo)\ > 1. 
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and construct a product grid of nodes 6 n := (6^, . . . , 0 d d ) G (9, where the 
multi-index n = (n i, . . . , n^) runs over {1, . . . , Afi} x ••• x {1,..., Nd}. The 
corresponding interpolation formula, in terms of Lagrange basis polynomials, 
is then 

(Ah,...,AL) d 

U(0)= U(O n )Y[e n ^;O\,...,0^). 

n=( 1 , •••,!) i=1 

The problem with tensor product grids for interpolative collocation is the 
same as for tensor product quadrature: the curse of dimension, i.e. the large 
number of nodes needed to adequately resolve features of functions on high- 
dimensional spaces. The curse of dimension can be partially circumvented by 
using interpolation through sparse grids, e.g. those of Smolyak type. 

Collocation for arbitrary unstructured sets of nodes — such as those that 
arise when inheriting an archive of ‘legacy’ data that cannot be modified 
or extended for whatever reason — is a notably tricky subject, essentially 
because it boils down to polynomial interpolation through an unstructured set 
of nodes. Even the existence of interpolating polynomials such as analogues 
of the Lagrange basis polynomials is not, in general, guaranteed. 

Other Approximation Strategies. There are many other strategies for 
the construction of collocation solutions, especially in high dimension, besides 
polynomial bases. Common choices include splines and radial basis functions; 
see the bibliographic notes at the end of the chapter for references. Another 
popular method is Gaussian process regression, which is the topic of the next 
section. 


13.3 Gaussian Process Regression 

The interpolation approaches of the previous section were all deterministic in 
two senses: they assume that the values U(0 n ) are observed exactly, without 
error and with perfect reproducibility; they also assume that the correct form 
for an interpolated value U ( 0 ) away from the nodal set is a deterministic 
function of the nodes and observed values. In many situations in the natural 
sciences and commerce, these assumptions are not appropriate. Instead, it 
is appropriate to incorporate an estimate of the observational uncertainties, 
and to produce probabilistic predictions; this is another area in which the 
Bayesian perspective is quite natural. 

This section surveys one such method of stochastic interpolation, known 
as Gaussian process regression or kriging ; as ever, the quite rigid properties 
of Gaussian measures hugely simplify the presentation. The essential idea is 
that we will model U as a Gaussian random field; the prior information on U 
consists of a mean field and a covariance operator, the latter often being given 
in practice by a correlation length; the observations of U at discrete points 
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are then used to condition the prior Gaussian using Schur complementation, 
and thereby produce a posterior Gaussian prediction for the value of U at 
any other point. 

Noise- Free Observations. Suppose for simplicity that we observe the 
values y n := U(0 n ) exactly, without any observational error. We wish to use 
the data {(0 n , y n ) \ n = 1, . . . , N} to make a prediction for the values of U at 
other points in the domain O. To save space, we will refer to 0° = (#i, . . . , On) 
as the observed points and y° = ( 7 / 1 , . . . , pn) as the observed values ; together, 
( 0°,y° ) constitute the observed data or training set. By way of contrast, we 
wish to predict the value (s) y p of U at point (s) 0 P , referred to as the pre- 
diction points or test points. We will abuse notation and write m{6°) for 
(ra(0o)> • • • , ip(6>tv)), and so on. 

Under the prior assumption that U is a Gaussian random field with known 
mean m : O — »• R and known covariance function C : O x O — > R, the random 
vector ( 2 / 0 , y p ) is a draw from a multivariate Gaussian distribution with mean 
(ra( 0 °), m(6 p )) and covariance matrix 

~C(0°, 0°) C((9°,6>p) t " 

C((9°,6>p) C(0 P , 6 P ) 

(Note that in the case of N observed data points and one new value to be 
predicted, C(0° ,0°) is an N x N block, C(0 P , 0 P ) is 1 x 1 , and C(0°, 6 P ) is a 
1 x N Tow vector’.) By Theorem 2.54, the conditional distribution of U{6 P ) 
given the observations U(0°) = y° is Gaussian, with its mean and variance 
given in terms of the Schur complement 

S := C( 6 > p , 0 P ) - C(<9 P , Q°) J C(Q° , 6>°)- 1 C(<9°, 6 P ) 
by 

U{6 P )\6°, y° - Af{m p + C(6> p , 6>°)C(6>°, 0°)-\y o - m(0° )), S). 

This means that, in practice, a draw U(0 P ) from this conditioned Gaussian 
measure would be used as a proxy /prediction for the value U(0 P ). Note that 
S depends only upon the locations of the interpolation nodes 0° and 0 P . /Thus, 
if variance is to be used as a measure of the precision of the estimate Z7(0 P ), 
then it will be independent of the observed data y°. 

Noisy Observations. The above derivation is very easily adapted to the case 
of noisy observations, i.e. y° = U (0°)-}-77, where y is some random noise vector. 
As usual, the Gaussian case is the simplest, and if 77 ~ A/"(0, T), then the net 
effect is to replace each occurrence of “(7(0°, 0 °)” above by “T + C(Q°,0°y\ 
In terms of regularization, this is nothing other than quadratic regularization 
using the norm || • || r i / 2 = ||T -1 / 2 • || on R N . 
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One advantage of regularization, as ever, is that it sacrifices the interpola- 
tion property (exactly fitting the data) for better-conditioned solutions and 
even the ability to assimilate ‘contradictory’ observed data, i.e. 0 n = but 
y n ^ ym- See Figure 13.3 for simple examples. 

Example 13.9. Consider O = [0, 1], and suppose that the prior description 
of U is as a zero-mean Gaussian process with Gaussian covariance kernel 

C(M') := exp {-L—LJ. 


t > 0 is the correlation length of the process, and the numerical results 
illustrated in Figure 13.3 use £ = \. 

(a) Suppose that values y° = 0.1, 0.8 and 0.5 are observed for U at 6° = 0.1, 
0.5, 0.9 respectively. In this case, the matrix C(0°, 6°) and its inverse are 
approximately 


C(6°,6°) 


1.000 

0.278 

0.006 

0.278 

1.000 

0.278 

0.006 

0.278 

1.000 


cio 0 ^ 0 )- 1 


1.090 

-0.327 

0.084 


-0.327 

1.182 

-0.327 


0.084 

-0.327 

1.090 


Figure 13.3(a) shows the posterior mean field and posterior variance: note 
that the posterior mean interpolates the given data. 

(b) Now suppose that values y° = 0.1, 0.8, 0.9, and 0.5 are observed for U 
at 6° =0.1, 0.5, 0.5, 0.9 respectively. In this case, because there are two 
contradictory values for U at 0 = 0.5, we do not expect the posterior 
mean to be a function that interpolates the data. Indeed, the matrix 
C(0°, 6°) has a repeated row and column: 


0.278 0.278 0.006 
1.000 1.000 0.278 

1.000 1.000 0.278 ’ 

0.278 0.278 1.000_ 

and hence C{6°,6°) is not invertible. However, assuming that y° = 
U{6°) + jV(0, t? 2 ), with rj > 0, restores well-posedness to the problem. 
Figure 13.3(b) shows the posterior mean and covariance field with the 
regularization y = 0.1. 


C(6°,6°) 


1.000 

0.278 

0.278 

0.006 
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Perfectly observed data: the pos- Data additively perturbed by 

terior mean interpolates the ob- i.i.d. draws from A/"(0, 0.01): the pos- 

served data. terior mean is not interpolative. 

Fig. 13.3: A simple example of Gaussian process regression/kriging in one 
dimension. The dots show the observed data points, the black curve the 
posterior mean of the Gaussian process £7, and the shaded region the posterior 
mean =b one posterior standard deviation. 


Variations. There are many ‘flavours’ of the kriging method, essentially 
determined by the choice of the prior, and in particular the choice of the 
prior mean. For example, simple kriging assumes a known spatially constant 
mean field, i.e. K[U(6)] = m for all 6. 

A mild generalization is ordinary kriging , in which it is again assumed 
that K[U(6)] = rrt for all 0, but m is not assumed to be known. This under- 
determined situation canjpe rendered tractable by including additional ass- 
umptions on the form of U(6 P ) as a function of the data one simple 

assumption of this type is a linear model of the form U{6 P ) = J2n=i w nUn 
for some weights w = (icq, . . . ,wn) E M n — note well that this is not the 
same as linearly interpolating the observed data. 

In this situation, as in the Gauss-Markov theorem (Theorem 6.2), the 
natural criteria of zero mean error (unbiasedness) and minimal squared error 
are used to determine the estimate of U (0 P ): writing U ( 0 P ) = X^=i w nUm the 
unbiasedness requirement that E [U ( 0 P ) — U (0 P )] = 0 implies that the weights 

w n sum to 1, and minimizing E[(/7(0 P ) — U (6 P )) becomes the constrained 
optimization problem 


minimize: C(0 P , 6 P ) — 2 w J C(6 p , 6°) + w J C(6°, 9°)w 

among: w E 
N 

subject to: w n = 1. 

n— 1 
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By the method of Lagrange multipliers, the weight vector w and the Lagrange 
multiplier A E R are given jointly as the solutions of 


'c{9°,d°) i 


w 


i 

o 

O 

1 

1 0 


A 


1 


(13.4) 


Even when C(6°,6°) is positive-definite, the matrix on the left-hand side is 
not invertible: however, the column vector on the right-hand side does he in 
the range, and so it is possible 2 to solve for (re, A). 


13.4 Bibliography 

Non-intrusive methods for UQ, including non-intrusive spectral projection 
and stochastic collocation, are covered by Le Maitre and Knio (2010, Chap- 
ter 3) and Xiu (2010, Chapter 7). A classic paper on interpolation using sparse 
grids is that of Barthelmann et al. (2000), and applications to UQ for PDEs 
with random input data have been explored by, e.g., Nobile et al. (2008a, b). 
Narayan and Xiu (2012) give a method for stochastic collocation on arbitrary 
sets of nodes using the framework of least orthogonal interpolation, following 
an earlier Gaussian construction of de Boor and Ron (1990). Yan et al. (2012) 
consider stochastic collocation algorithms with sparsity-promoting i 1 regu- 
larizations. Buhmann (2003) provides a general introduction to the theory 
and practical usage of radial basis functions. A comprehensive introduction 
to splines is the book of de Boor (2001); for a more statistical interpretation, 
see, e.g., Smith (1979). 

Kriging was introduced by Krige (1951) and popularized in geostatistics by 
Matheron (1963). See, e.g., Conti et al. (2009) for applications to the interpo- 
lation of results from slow or expensive computational methods. Rasmussen 
and Williams (2006) cover the theory and application of Gaussian processes 
to machine learning; their text also gives a good overview of the relationships 
between Gaussian processes and other modelling perspectives, including reg- 
ularization, reproducing kernel Hilbert spaces, and support vector machines. 


13.5 Exercises 

Exercise 13.1. Choose distinct nodes G @ = [0,1] and corre- 

sponding values yi , . . . , y n G R. Interpolate these data points in all the 
ways discussed so far in the text. In particular, interpolate the data using 


2 Indeed, many standard numerical linear algebra packages will readily solve the system 
(13.4) without throwing any error whatsoever. 
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apiecewise linear interpolation, using a polynomial of degree TV — 1, and 
using Gaussian processes with various choices of covariance kernel. Plot the 
interpolants on the same axes to get an idea of their qualitative features. 

Exercise 13.2. Extend the analysis of the simple harmonic oscillator from 
Examples 13.5 and 13.8 to incorporate uncertainty in the initial condition, 
and calculate sensitivity indices with respect to the various uncertainties. 
Perform the same analyses with an alternative uncertainty model, e.g. the 
log-normal model of Example 12.6. 

Exercise 13.3. Perform the analogue of Exercise 13.2 for the Van der Pol 
oscillator 

u(t ) — /i( 1 — u{t) 2 )u{t) + uj 2 u(t) = 0. 

Compare your results with those of the active subspace method (Example 
10.20 and Figure 10.1). 

Exercise 13.4. Extend the analysis of Exercises 13.2 and 13.3 by treating 
the time step h > 0 of the numerical ODE solver as an additional source of 
uncertainty and error. Suppose that the numerical integration scheme for the 
ODE has a global truncation error at most Ch r for some C, r > 0, and so 
model the exact solution to the ODE as the computed solution plus a draw 
from Unif \—Ch r ,Ch r ). Using this randomly perturbed observational data, 
calculate approximate spectral coefficients for the process using the NISP 
scheme. (For more sophisticated randomized numerical schemes for ODEs 
and PDEs, see, e.g., Schober et al. (2014) and the works listed as part of the 
Probabilistic Numerics project http://www.probabilistic-numerics.org.) 

Exercise 13.5. It often happens that the process U is not initially defined 
on the same probability space as the gPC basis functions: in particular, this 
situation can arise if we are given an archive of legacy data values of U 
without corresponding inputs. In this situation, it is necessary to transform 
both sets of random variables to a common probability space. This exercise 
concerns an example implementation of this procedure in the case that U is 
a real- valued Gaussian mixture : for some weights . . . , w j > 0 summing 
to 1, means m i, . . . , mj E R, and variances af, ... ,<r 2 > o, the Lebesgue 
probability density fjj\ R —> [ 0 , oo) of U is given as the following convex 
combination of Gaussian densities: 

fu{x) := E r~ exp f ~ ^ 2 V ( 13 -5) 

U V 2a ? / 

Suppose that we wish to perform a Hermite expansion of £7, i.e. to write U = 
^keNo ' u feHe/ e (Z), where Z ~ 7 = J\f( 0, 1). The immediate problem is that 
U is defined as a function of 6 in some abstract probability space ((9, J^,/i), 
not as a function of z in the concrete probability space (R, ^(R),y). 

(a) Let O = {1, . . . , J} x R, and define a probability measure fi on @ by 
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J 

/i := (g) A f{rrij,(j 2 j). 

3 = 1 


(In terms of sampling, this means that draws (j, y) from y are per- 
formed by first choosing j E {1, . . . , J} at random according the weight- 
ing rci, . . . , rej, and then drawing a Gaussian sample y ~ J\[{rrij, cr 2 ).) 
Let P: O R denote projection onto the second component, i.e. 
P(j,y) := y. Show that the push-forward measure P*/i on R is the 
Gaussian mixture (13.5). 

(b) Let Fjj : R — >• [0,1] denote the cumulative distribution function (CDF) 
of U, i.e. 

:=P m [G<x] = [ fu(s)ds. 

Show that FV is invertible, and that if V ^ Unif([0, 1]), then F^ l {V) has 
the same distribution as U. 

(c) Let $ denote the CDF of the standard normal distribution 7 . Show, by 
change of integration variables, that 

(U, Re k ) L 2 h) = [ Fu\v) Hek^-^v^dv. (13.6) 

J 0 

(d) Use your favourite quadrature rule for uniform measure on [0, 1] to app- 
roximately evaluate (13.6), and hence calculate approximate Hermite 
PC coefficients Uk for U . 

(e) Choose some rrij and cPy and generate N i.i.d. sample realizations 
?/i, . . . , 2 /jv °f U using the observation of part (a). Approximate Fjj by 
the empirical CDF of the data, i.e. 


Fu(x) « F y (x) 


{1 < n < N \ y n < x} 
N 


Use this approximation and your favourite quadrature rule for uniform 
measure on [0, 1] to approximately evaluate (13.6), and hence calculate 
approximate Hermite PC coefficients Uk for U . (This procedure, using the 
empirical CDF, is essentially the one that we must use if we are given 
only the data y and no functional relationship of the form y n = U(Q n ).) 

(f) Compare the results of parts (d) and (e). 

Exercise 13.6. Choose nodes in the square [0, l] 2 and corresponding data 
values, and interpolate them using Gaussian process regression with a radial 
covariance function such as C(x,x') = exp(— \\x — U|| 2 /r 2 ), with r > 0 being 
a correlation length parameter. Produce accompanying plots of the posterior 
variance field. 


Chapter 14 

Distributional Uncertainty 


Technology, in common with many other 
activities, tends toward avoidance of risks by 
investors. Uncertainty is ruled out if possible. 
[Pjeople generally prefer the predictable. Few 
recognize how destructive this can be, how it 
imposes severe limits on variability and thus 
makes whole populations fatally vulnerable to 
the shocking ways our universe can throw the 
dice. 


Heretics of Dune 
Frank Herbert 


In the previous chapters, it has been assumed that an exact model is 
available for the probabilistic components of a system, i.e. that all probability 
distributions involved are known and can be sampled. In practice, however, 
such assumptions about probability distributions are always wrong to some 
degree: the distributions used in practice may only be simple approximations 
of more complicated real ones, or there may be significant uncertainty about 
what the real distributions actually are. The same is true of uncertainty about 
the correct form of the forward physical model. In the Bayesian paradigm, 
similar issues arise if the available information is insufficient for us to specify 
(or ‘elicit’) a unique prior and likelihood model. Therefore, the topic of this 
chapter is how to deal with such uncertainty about probability distributions 
and response functions. 


(c) Springer International Publishing Switzerland 2015 

T.J. Sullivan, Introduction to Uncertainty Quantification , Texts 

in Applied Mathematics 63, DOI 10.1007/978-3-319-23395-6-14 
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14.1 Maximum Entropy Distributions 


Suppose that we are interested in the value of some quantity of interest 

that is a functional of a partially known probability measure /jf on a space 
X. Very often, Q(p^) arises as the expected value with respect to p) of some 
function q: V — > R, so the objective is to determine 

Q(P)=E x ^[q(X)]. 

Now suppose that /if is known only to he in some subset A C AJi(T). How 
should we try to understand or approximate Q(/A)l One approach is the 
following MaxEnt Principle : 

Definition 14.1. The Principle of Maximum Entropy states that if all one 
knows about a probability measure p is that it lies in some set A C M.\{X), 
then one should take p to be the element /i ME G A of maximum entropy. 

There are many heuristics underlying the MaxEnt Principle, including 
appeals to equilibrium thermodynamics and attractive derivations due to 
Wallis and Jaynes (2003). If entropy is understood as being a measure of 
uninformativeness, then the MaxEnt Principle can be seen as an attempt to 
avoid bias by selecting the ‘least biased’ or ‘most uninformative’ distribution. 

Example 14.2 (Unconstrained maximum entropy distributions). If X = 
{1, . . . , m} and p G M> 0 is a probability measure on X, then the entropy of 
p is 

m 

H(p) := - l °SPi- ( 14 -!) 

2 = 1 

The only constraints on p are the natural ones that Pi > 0 and that S(p) := 
Yl'iLiPi = 1- Temporarily neglect the inequality constraints and use the 
method of Lagrange multipliers to find the extrema of H(p) among all p G M m 
with S(p) = 1; such p must satisfy, for some A G R, 


0 = VH{p) - A VS(p) 


1 + log pi + A 

1 + log Pm + A 


It is clear that any solution to this equation must have p\ = • • • = p m , for if 
Pi and pj differ, then at most one of 1 + logp^ + A and 1 + log pj + A can equal 
0 for the same value of A. Therefore, since S(p) = 1, it follows that the unique 
extremizer of H(p) among {p G M m | S(p) = 1} is p\ = • • • = p rn = T . The 
inequality constraints that were neglected initially are satisfied, and are not 
active constraints, so it follows that the uniform probability measure on X is 
the unique maximum entropy distribution on T. 
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A similar argument using the calculus of variations shows that the unique 
maximum entropy probability distribution on an interval [a, b] C R is the 
uniform distribution tt^ ^ dx. 

Example 14.3 (Constrained maximum entropy distributions). Consider the 
set of all probability measures /i on 1 that have mean m and variance s 2 ; 
what is the maximum entropy distribution in this set? Consider probability 
measures p that are absolutely continuous with respect to Lebesgue measure, 
having density p. Then the aim is to find p to maximize 

H(p) = — / p(pc) log p(pc) dx, 

it 

subject to the constraints that p > 0, f R p(x) dx = 1, f R xp(x) d x = m and 
f R (x — m) 2 p(x) dx = s 2 . Introduce Lagrange multipliers c = (co,ci,C 2 ) and 
the Lagrangian 


F c (p) :=H(p) + c 0 


p(x) dx + ci 


xp(x) dx + C 2 



m) 2 p{x) dx. 


Consider a perturbation p + ta ; if p is indeed a critical point of F Cl then, 
regardless of cr, it must be true that 


d 
d t 


F c (p + ter) 


= 0 . 

t=o 


This derivative is given by 


Vio+to) 


a(x) [— log p(x) — 1 + cq + c\x + c^[x — m) 2 dx 


1=0 


Since it is required that ^F c (p + ter) | = 0 for every cr, the expression in 

the brackets must vanish, i.e. 

p(x) = exp(— co + 1 — c\x — C 2 {x — m) 2 ). 

Since p{x) is the exponential of a quadratic form in x, p must be a Gaussian of 
some mean and variance, which, by hypothesis, are m and s 2 respectively, i.e. 

Co = 1 — log (l / 27T8 2 ) , 

Cl = 0, 

° 2 = 2^2 • 

Thus, the maximum entropy distribution on R of with mean m and variance 
s 2 is A?(m, s 2 ), with entropy 


, s 2 )) 


1 

2 


log(27res 2 ). 
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Discrete Entropy and Convex Programming. In discrete settings, the 
entropy of a probability measure p E Ati({l, . . . , m}) with respect to the 
uniform measure as defined in (14.1) is a strictly convex function of p E R™ 0 . 
Therefore, when p is constrained by a family of convex constraints, finding 
the maximum entropy distribution is a convex program: 

m 

minimize: E Pi log Pi 

2 = 1 

with respect to: p E R m 
subject to: p > 0 

p • 1 = 1 

<fi(p) < 0 for i = 1, . . . , n, 

for given convex functions </?i, . . . , <^ n : M m — »• R. This is useful because an 
explicit formula for the maximum entropy distribution, such as in Example 
14.3, is rarely available. Therefore, the possibility of efficiently computing the 
maximum entropy distribution, as in this convex programming situation, is 
very attractive. 

Remark 14.4. Note well that not all classes of probability measures contain 
maximum entropy distributions: 

(a) The class of all absolutely continuous p E Afi(R) with mean 0 but arbi- 
trary variance contains distributions of arbitrarily large entropy. 

(b) The class of all absolutely continuous p E A4i(R) with mean 0 and 
second and third moments equal to 1 has all entropies bounded above 
but there is no distribution which attains the maximal entropy. 

Remark 14.5. There are some philosophical, mathematical, and practical 
objections to the use of the Principle of Maximum Entropy: 

(a) The MaxEnt Principle is an application-blind selection mechanism. It 
asserts that the correct course of action when faced with a collection 
A C Ali(T) and an unknown p^ E A is to select a single representative 
p ME E A and to make the approximation Q(p^) ~ Q(p ME ) regardless 
of what Q is. This is in contrast to hierarchical and optimization-based 
methods later in this chapter. Furthermore, MaxEnt distributions are 
typically ‘nice’ (exponentially small tails, etc.), whereas many practical 
problems with high consequences involve heavy-tailed distributions. 

(b) Recalling that in fact all entropies are relative entropies (Kullback- 
Leibler divergences), the result of applying the MaxEnt Principle is dep- 
endent upon the reference measure chosen, and by Theorem 2.38 even 
moderately complex systems do not admit a uniform measure for use as a 
reference measure. Thus, the MaxEnt Principle would appear to depend 
upon an ad hoc choice of reference measure. 
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14.2 Hierarchical Methods 

As before, suppose that we are interested in the value Q(fA) of some quantity 
of interest that is a functional of a partially known probability measure /A on 
a space A, and that (A is known to lie in some subset A C Mi (A). Suppose 
also that there is some knowledge about which /i G A are more or less likely 
to be fA , and that this knowledge can be encoded in a probability measure 

7 r g Mi (A). 

In such a setting, Q{fA) may be studied via its expected value 

E^7t[Q(/x)] 

(i.e. the average value of Q(/A) when fi is interpreted as a measure- valued ran- 
dom variable distributed according to tt) and measures of dispersion such as 
variance. This point of view is appealing when there is good reason to believe 
a particular form for a probability model but there is doubt about parame- 
ter values, e.g. there are physical reasons to suppose that /A is a Gaussian 
measure N(mA , dA), and tt describes a probability distribution (perhaps a 
Bayesian prior) on possible values m and C for nA and dA. 

Sometimes this approach is repeated, with another probability measure 
on the parameters of 7 r, and so forth. This leads to the study of hierarchical 
Bayesian models. 


14.3 Distributional Robustness 

As before, suppose that we are interested in the value Q(/A) of some quantity 
of interest that is a functional of a partially-known probability measure [A 
on a space A, and that /A is known only to he in some subset A C Mi (A). 
In the absence of any further information about which /i G A are more or less 
likely to be /A, and particular if the consequences of planning based on an 
inaccurate estimate of Q(fA) are very high, it makes sense to adopt a posture 
of ‘healthy conservatism’ and compute bounds on Q(/A) that are as tight as 
justified by the information that [A G A, but no tighter, i.e. to find 

Q(A) := inf Q(/i) and Q(A) := sup Q(fi). 

~ neA ^ eA 

When Q(/A) is the expected value with respect to fi of some function q: A 
R, the objective is to determine 

Q(A) := inf and Q(A) := supE M [g]. 


The inequality 


Q(A) < Q{^) < Q{A) 
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is, by construction, the sharpest possible bound on Q(/A) given only infor- 
mation that fA G A: any wider inequality would be unnecessarily pessimistic, 
with one of its bounds not attained; any narrower inequality would ignore 
some feasible scenario /i G A that could be /A . The obvious question is, can 
Q(A) and Q(A) be computed? 

Naturally, the answer to this question depends upon the form of the adm- 
issible set A. In the case that A is, say, a Hellinger ball centred upon a 
nominal probability distribution //*, i.e. the available information about [A 
is that 

dn(/A , n*) < 5, 

for known 5 > 0, then Proposition 5.12 gives an estimate for E^t [q] in terms of 
E M *[g]. The remainder of this chapter, however, will consider admissible sets 
A of a very different type, those specified by equality or inequality constraints 
on expected values of test functions, otherwise known as generalized moment 
classes. 

Example 14.6. As an example of this paradigm, suppose that it is desired 
to give bounds on the quality of some output Y = g(X) of a manufacturing 
process in which the probability distribution of the inputs A is partially 
known. For example, quality control procedures may prescribe upper and 
lower bounds on the cumulative distribution function of A, but not the exact 
CDF of A, e.g. 


0 < [-00 < A < a] < 0.1 

0.8 < Px~/i t [u < A < b] < 1.0 
0 < Px~/xt [b < X < oo] < 0.1. 

Let A denote the (infinite-dimensional) set of all probability measures /x on R 
that are consistent with these three inequality constraints. Given the input- 
to-output map /, what are optimal bounds on the cumulative distribution 
function of T, i.e., for t G R, what are 

inf Px~/z[/(A) < t\ and sup Py^[/(A) < t]?. (14.2) 

The results of this section will show that these extremal values can be found 
by solving an optimization problem involving at most eight optimization 
variables, namely four possible values xo,...,X 3 G R for A, and the four 
corresponding probability masses wo , . . . , ws > 0 that sum to unity. More 
precisely, we minimize or maximize 

3 

^2wiI[f(Xi) < t ] 

2 = 0 
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subject to the constraints 


3 

0 < ^^Wil[xi < a] <0.1 
i = o 
3 

0.8 < ^^Wil[a < Xi < b\ < 1.0 
2 = 0 

3 

0 < Wil[xi > b] < 0.1. 

2 = 0 

In general, this problem is a non-convex global optimization problem that 
can only be solved approximately. However, for fixed positions {a^}f =0 , the 
optimal weights {u^}f =0 can be determined quickly and accurately using the 
tools of linear programming. Thus, the problem (14.2) reduces to a nonlinear 
family of linear programs, parametrized by 0 . 

Finite Sample Spaces. Suppose that the sample space X = {1, . . . , if} is 
a finite set equipped with the discrete topology. Then the space of measur- 
able functions /: X R is isomorphic to and the space of probability 
measures p on X is isomorphic to the unit simplex in M K . If the available 
information on p) is that it lies in the set 

A := {/x G Mi(X) | E M [y> n ] < c n for n = 1, . . . , TV} 

for known measurable functions (/?i, . . . , (fN : X and values ci, . . . , cn G 
R, then the problem of finding the extreme values of E M [g] among fi G A 
reduces to linear programming: 

extremize: p • q 
with respect to: p G 
subject to: p > 0 

p • 1 = 1 

P • Vn < C n for 77, = 1, . . . , N. 

Note that the feasible set A for this problem is a convex subset of M A ; indeed, 
A is a polytope , i.e. the intersection of finitely many closed half-spaces of 
M A . Furthermore, as a closed subset of the probability simplex in M A , A is 
compact. Therefore, by Corollary 4.23, the extreme values of this problem are 
certain to be found in the extremal set ext(*4). This insight can be exploited 
to great effect in the study of distributional robustness problems for general 
sample spaces X. 

Remarkably, when the feasible set A of probability measures is suffi- 
ciently like a polytope, it is not necessary to consider finite sample spaces. 
What would appear to be an intractable optimization problem over an 
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infinite-dimensional set of measures is in fact equivalent to a tractable 
finite-dimensional problem. Thus, the aim of this section is to find a finite- 
dimensional subset Aa of A with the property that 

ext Q(n) = ext Q(/a). 

To perform this reduction, it is necessary to restrict attention to probability 
measures, topological spaces, and functionals that are sufficiently well-behaved 

Extreme Points of Moment Classes. The first step in this reduction is 
to classify the extremal measures in sets of probability measures that are 
prescribed by inequality or equality constraints on the expected value of 
finitely many arbitrary measurable test functions, so-called moment classes. 
Since, in finite time, we can only verify — even approximately, numerically 
— the truth of finitely many inequalities, such moment classes are appealing 
feasible sets from an epistemological point of view because they conform to 
the dictum of Karl Popper (1963) that “Our knowledge can be only finite, 
while our ignorance must necessarily be infinite.” 

Definition 14.7. A Borel measure fi on a topological space A is called inner 
regular if, for every Borel-measurable set E Cl, 

n(E) = sup {/jl(K) | K C E and K is compact}. 

A pseudo-Radon space is a topological space on which every Borel probability 
measure is inner regular. A Radon space is a separable, metrizable, pseudo- 
Radon space. 

Example 14.8. (a) Lebesgue measure on Euclidean space M n (restricted to 
the Borel a - algebra ^(M n ), if pedantry is the order of the day) is an 
inner regular measure. Similarly, Gaussian measure is an inner regular 
probability measure on M n . 

(b) However, Lebesgue/Gaussian measures on R equipped with the topology 
of one-sided convergence are not inner regular measures: see Exercise 
14.3. 

(c) Every Polish space (i.e. every separable and completely metrizable topo- 
logical space) is a pseudo-Radon space. 

Compare the following definition of a barycentre (a centre of mass) for a 
set of probability measures with the conclusion of the Choquet-Bishop-de 
Leeuw theorem (Theorem 4.15): 

Definition 14.9. A barycentre for a set A C Mi (A) is a probability measure 
p G Mi (A) such that there exists p G Mi (ext (*4.)) such that 

p(B) = I v(B) dp(o) for all measurable BEX. (14.3) 

J ext(*4) 

The measure p is said to represent the barycentre pi. 
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Fig. 14.1: By the Choquet-Kendall theorem (Theorem 14.11), like finite- 
dimensional simplices, Choquet simplices S in a vector space V are charac- 
terized by the property that the intersection of any two homothetic images 
of S, (aqS + v\) D (aqS-b V 2 ), with aq, 02 > 0 and V\,V 2 G V, is either empty, 
a single point, or another homothetic image of S. This property holds for the 
simplex (a), but not for the non-simplicial convex set (b). 


Recall that a d-dimensional simplex is the closed convex hull of d + 1 
points po, ... ,dci such that pi — po, • • • ,Pd — Po are linearly independent. The 
next ingredient in the analysis of distributional robustness is an appropriate 
infinite-dimensional generalization of the notion of a simplex — a Choquet 
simplex — as a subset of the vector space of signed measures on a given mea- 
surable space. One way to define Choquet simplices is through orderings and 
cones on vector spaces, but this definition can be somewhat cumbersome. In- 
stead, the following geometrical description of Choquet simplices, illustrated 
in Figure 14.1, is much more amenable to visual intuition, and more easily 
checked in practice: 

Definition 14.10. A homothety of a real topological vector space V is the 
composition of a positive dilation with a translation, i.e. a function /: V V 
of the form f{pc) = ax + v, for fixed a > 0 and v G V. 

Theorem 14.11 (Choquet-Kendall). A convex subset S of a topological 
vector space V is a Choquet simplex if and only if the intersection of any 
two homothetic images of S is empty, a single point, or another homothetic 
image of S . 

With these definitions, the extreme points of moment sets of probability 
measures can be described by the following theorem: 

Theorem 14.12 (Winkler, 1988). Let (T,J^) be a measurable space and 
let S C Ali(J^) be a Choquet simplex such that ext (5) consists of Dirac 
measures. Fix measurable functions . . . , ip n : X R and c\, . . . , c n G R 
and let 
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Fig. 14.2: Heuristic justification of Winkler’s classification of extreme points 
of moment sets (Theorem 14.12). Observe that the extreme points of the 
dark grey set A consist of convex combinations of at most 2 point masses, 
and 2 = 1+ the number of constraints defining A. 


Then A is convex and its extremal set satisfies 



/ 

p E A 

\~~\TTl c \ 

p = 1 W WXi ; 

1 < m < n + 1, and 

= < 

— — ’ \ 
the vectors (ipi(xi), . 1)™ , 


< 

are linearly independent 


Furthermore, if all the moment conditions defining A are equalities [pf = 
Ci instead of inequalities E M [<^] < Ci, then ext (A) = Aa- 

The proof of Winkler’s theorem is rather technical, and is omitted. The 
important point for our purposes is that, when X is a pseudo- Radon space, 
Winkler’s theorem applies with S = M.\(X), so ext (A) C dn A n (X), where 


An(X) 



N 

J2 w i s *i G -woo 

i = 0 


. . . ,w N > 0, 
Wo + • * * + 1C tv = 1 5 
Xq , . . . , X n U X 


denotes the set of all convex combinations of at most N + 1 unit Dirac 
measures on the space T. Pictures like Figure 14.2 should make this an 
intuitively plausible claim. 

Optimization of Measure Affine Functionals. Having understood the 
extreme points of moment classes, the next step is to show that the opti- 
mization of suitably nice functionals on such classes can be exactly reduced 
to optimization over the extremal measures in the class. 

Definition 14.13. For A C a function F : A R U {Too} is said 

to be measure affine if, for all fi E A and p E A4i(ext(A)) for which (14.3) 
holds, F is p-integrable with 
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F(p) = [ F(v)&p(y). (14.4) 

J ext (.4.) 

As always, the reader should check that the terminology 6 measure affine’ 
is a sensible choice by verifying that when A = {l,...,lL}isa finite sample 
space, the restriction of any affine function F : = Af±(T) R to a 

subset A C Afi(A) is a measure affine function in the sense of Definition 
14.13. 

An important and simple example of a measure affine functional is an 
evaluation functional, i.e. the integration of a fixed measurable function q : 

Proposition 14.14. If q is bounded either below or above, then fi i— >> E ^[q] 
is a measure affine map. 

Proof. First consider the case that q = Ie is the indicator function of a 
measurable set E C A. Suppose that p is a barycentre for A and that p E 
All (ext (*4.)) represents /i, i.e. 

fi(B) = / v(B) d p(y) for all measurable BEX. 

J ext (^4) 

For B = E, this is the statement that 

E/xtlf:] = [ E v [I E ]dp(v), 

J ext (^4) 

which is (14.4). To complete the proof, verify the claim for q a linear 
combination of indicator functions, then for a sequence of such functions 
increasing to a function that is bounded above (resp. decreasing to a func- 
tion that is bounded below), and apply the monotone class theorem — see 
Exercise 14.4. □ 

Proposition 14.15. Let A C A4\(X) be convex and let F be a measure 
affine function on A. Then F has the same extreme values on A and ext(A). 

Proof. Without loss of generality, consider the maximization problem; the 
proof for minimization is similar. Let p e Abe arbitrary and choose a prob- 
ability measure p E A4i(ext(*4.)) with barycentre fi. Then, it follows from the 
barycentric formula (14.4) that 

F(ff) < sup F(y) < sup F(y). (14.5) 

i/GsuppO) veext(A) 

First suppose that F{/i) is finite. Necessarily, sup^^^ F(y) is 

also finite, but it remains to show that the two suprema are equal. Let £ > 0 
be arbitrary. Let fF be |-suboptimal for the problem of maximizing F over 
A , i.e. F(/j,*) > sup MG ^ F(fi) — |, and let v* be |-suboptimal for the problem 
of maximizing F over ext (A). Then 
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F(v*) > sup F(u) - - 

z^Gext(yl) ^ 

> — — by (14.5) with p = //* 

> sup F(/x) — £. 

Since £ > 0 was arbitrary, sup MG ^F(/i) = sup^^^ F(v), and this proves 
the claim in this case. 

In the case that sup F (/T) = Too, let C, e > 0. Then there exists 
some fjL* G A such that F(/jl*) > C + £. Then, regardless of whether or 
not sup^^gxt^) F(v) is finite, (14.5) with /i = /r* implies that there is some 
zT G ext(*4) such that 

F{v*) > F(/i*) -£>C + £- £ = C. 

However, since C > 0 was arbitrary, it follows that in fact sup^ Gext ^) F( v) = 
Too, and this completes the proof. □ 

In summary, we now have the following: 

Theorem 14.16. Let X be a pseudo-Radon space and let A C Aii(X) be a 
moment class of the form 

A := {// e Mi(x) I E < 0 for j = 1, . . . , N} 

for prescribed measurable functions ipj: X R. Then the extreme points of 
A are given by 

ext (*4.) C Aa := A D An(X) 


( 

for some wo , 

• • • , W N € [0, 1], x 0 , ■ ■ ■ ■ 

xn G X, 

be Mi(A) 

and E*4 o 

V = E»4 0 w i 5 x, 

Eto = r 

> 

< 

W i ip j {x i ) < 0 for j = 1, 

...,N j 


Hence, if q is bounded either below or above, then Q(A) = Q(Aa) and 
Q(A) = Q(Aa)- 

Proof. Winkler’s theorem (Theorem 14.12) implies that ext(*4) C Aa - Since 
q is bounded on at least one side, Proposition 14.14 implies that p F{/T) := 
E m [< 7 ] is measure affine. The claim then follows from Proposition 14.15. □ 
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Remark 14.17. (a) Theorem 14.16 is good news from a computational 
standpoint for two reasons: 

(i) Since any feasible measure in Aa is completely described by TV + 1 
scalars and 7V + 1 points of T, the reduced set of feasible measures is 
a finite-dimensional object — or, at least, it is as finite-dimensional 
as the space X is — and so it can in principle be explored using 
the finite-dimensional numerical optimization techniques that can 
be implemented on a computer. 

(ii) Furthermore, since the probability measures in Aa are finite sums 
of Dirac measures, expectations against such measures can be per- 
formed exactly using finite sums — there is no quadrature error. 

(b) That said, when (i E Aa has #supp(/i) 1, as may be the case 
with problems exhibiting independence structure like those considered 
below, it may be cheaper to integrate against a discrete measure /i = 
12^=0 a A Xi £ A a in a Monte Carlo fashion, by drawing some number 
1 #supp(/i) of independent samples from fi (i.e. X{ with proba- 

bility Cq). 

In general, the optimization problems over A a in Theorem 14.16 can only 
be solved approximately, using the tools of numerical global optimization. 
However, some of the classical inequalities of basic probability theory can be 
obtained in closed form by this approach. 


Example 14.18 (Markov’s inequality). Suppose that X is a non-negative 
real- valued random variable with mean K[X] < m > 0. Given t > m, what is 
the least upper bound on F[X > t]7 

To answer this question, observe that the given information says that the 
distribution /A of X is some (and could be any!) element of A , where 


A := {/i E A4i([0, oo)) 


Ex^^lX] < to } . 


This A is a moment class with a single moment constraint. By Theorem 
14.16, the least upper bound on Fx^^[X > t] among fi E A can be found by 
restricting attention to the set Aa of probability measures with support on 
at most two points xo,xi E [0, oo), with masses Wq,Wi respectively. 

Assume without loss of generality that the two point masses are located 
at xo and x\ with 0 < xq < x\ < oo. Now make a few observations: 

(a) In order to satisfy the mean constraint that F[X] < m, we must have 
xq < m. 

(b) If X\ > t and the mean constraint is satisfied, then moving the mass w\ at 
x\ to x[ := t does not decrease the objective function value Px~/i[A > t\ 
and the mean constraint is still satisfied. Therefore, it is sufficient to 
consider two-point distributions with x\ = t. 

(c) By similar reasoning, it is sufficient to consider two-point distributions 
with xo = 0. 

(d) Finally, suppose that xq = 0, x\ — t, but that 


= wqXo + w\X\ = wit < rn. 
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Then we may change the masses to 


w[ := m/t > wi, 

= 1 — m/t < wo, 


w 


keeping the positions fixed, thereby increasing the objective function 
value P x~n[X > t\ while still satisfying the mean constraint. 

Putting together the above observations yields that 


m 


supPx~ M [^ >t] = —, 

fieA + 

with the maximum being attained by the two-point distribution 

This result is exactly Markov’s inequality (the case p = 1 of Theorem 2.22). 

Independence. The kinds of constraints on measures (or, if you prefer, ran- 
dom variables) that can be considered in Theorem 14.16 include values for, 
or bounds on, functions of one or more of those random variables: e.g. the 
mean of X\, the variance of X2, the covariance of X3 and X4, and so on. 
However, one commonly encountered piece of information that is not of this 
type is that X 5 and Xq are independent random variables, i.e. that their joint 
law is a product measure. The problem here is that sets of product measures 
can fail to be convex (see Exercise 14.5), so the reduction to extreme points 
cannot be applied directly. Fortunately, a cunning application of Fubini’s the- 
orem resolves this difficulty. Note well, though, that unlike Theorem 14.16, 
Theorem 14.19 does not say that A a — ext(M); it only says that the opti- 
mization problem has the same extreme values over Aa and A. 

Theorem 14.19. Let A C wMi(T) be a moment class of the form 


A := < 


( 

E v,[ipj] < 0 for j = 1,. . 

-,N, ) 

K K 

< T = (^) Tk £ (^) Mi(Xk) 
k = 1 k=l 

E/lh Ypij\ < 0 for j = 1, . . 

• Wi, 

> 

< 

E p, K [VKj] < 0 for j = 1,. 



for prescribed measurable functions cpj : X -A R and (fkj- X 

Aa := {p G A | fik £ ^N+N k (Xk)} • 


.. Let 


Then, if q is bounded either above or below, Q(A) = Q(Aa) an d Q{A) = 

Q(Aa)- 
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Fig. 14.3: Optimization of a bilinear form B over a non-convex set A C M 2 
that has convex cross-sections. The black curves show level sets of B(x , y) = 
xy. Note that the maximum value of B over A is found at a point (x* ,y*) 
(marked with a diamond) such that x* and y* are both extreme points of the 
corresponding sections A y and A x * respectively. 


Proof. Let <s > 0 and let y* £ A be -suboptiuial for the maxiuiization 
of /x i — >> E M [g] over y £ A, i.e. 


> supE^fe] - 6 - . 

t*€A A + 1 

By Fubini’s theorem, 

E^*®...®^ [(/] = E M * pE M *(g)...(g )fj,* K [q\ J 

By the same arguments used in the proof of Theorem 14.16, y\ can be rep- 
laced by some probability measure v\ £ M.i(X\) with support on at most 
N + N\ points, such that v\ ® y-> & • • • ® u T £ A, and 

Ezn E^®---®/^ [q]_ > E M * [g]j x -\-l — K + 1 

Repeating this argument a further K — 1 times yields v = 
such that 

E*,[g] > sup E^[g] - e. 

Since e > 0 was arbitrary, it follows that 


y* K £ A , and 

£ 


— — - > supE M [q] - 
K + L uG a 


2e 


—l Vk £ ■Aa 


sup E M [g] = sup E^q}. 

/iGv4 

The proof for the infimum is similar. □ 
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Example 14.20. A simple two-dimensional optimization problem that illus- 
trates the essential features of Theorem 14.19 is that of optimizing a bilinear 
form on M 2 over a non-convex set with convex cross-sections. Suppose that 
A C M 2 is such that, for each the sections 

A x = {y e M | (x, y ) e A }, and 
A y = {x e M | (x, ?/) E A} 

are convex sets. Note that this does not imply that A itself is convex, as ill- 
ustrated in Figure 14.3. Let 5: M x M u K be a bilinear functional: for 
definiteness, consider B{pc,y) = xy. Since A is not convex, its extremal 
set is undefined, so it does not even make sense to claim that B has the 
same extreme values on A and ext(M). However, as can be seen in Figure 
14.3, the extreme values of B over A are found at points (x*,y*) for which 
x* E ext(A y ) and y* E ext(Ar*). Just as in the Fubini argument in the 
proof of Theorem 14.19, the optimal point can be found by either maximiz- 
ing maxa^^y B(x,y) with respect to y or maximizing max yG ^ x B(x,y) with 
respect to x. 

Remark 14.21. (a) In the context of Theorem 14.19, a measure y E A a is 
of the form 

K N-\-Nk (N+N 1 ,...,N+N k ) 

y = (^) ^ ] w ki k ^x k i k = 'y ] w Axi 

k= 1 i k = 0 i=(0,...,0) 

where, for a multi-index i E {0, . . . , TV + N±} x • • • x {0, . . . , N + Nk }, 

rn •= • • • w K i K > o, 

Xi . i 1 , . . . Xj(i K ^ E A . 

Note that this means that the support of y is a rectangular grid in T. 

(b) As noted in Remark 14.17(b), the support of a discrete measure y E Aa, 
while finite, can be very large when K is large: the upper bound is 

K 

#supp (n) = JJ(1 + AT + Nk). 

k= 1 

In such cases, it is usually necessary to sacrifice exact integration against 
y for the sake of computational cost and resort to Monte Carlo averages 
against y. 

(c) However, it is often found in practice that the y* E Aa that extremizes 
Q(y*) does not have support on as many distinct points of A as Theorem 
14.19 permits as an upper bound, and that not all of the constraints 
determining A hold as equalities. That is, there are often many inactive 
and non-binding constraints, and only those that are active and binding 
truly carry information about the extreme values of Q. 
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(d) Finally, note that this approach to UQ is non-intrusive in the sense that 
if we have a deterministic solver for g: X -A y and are interested in 
Ex~/d [q(g(X))] for some quantity of interest q : y -A M, then the deter- 
ministic solver can be used 6 as is’ at each support point x of fi E A a hr 
the optimization with respect to fi over A. 
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In addition to epistemic uncertainty about probability measures, applications 
often feature epistemic uncertainty about the functions involved. For exam- 
ple, if the system of interest is in reality some function g t from a space X 
of inputs to another space y of outputs, it may only be known that g t lies 
in some subset Q of the set of all (measurable) functions from X to y; fur- 
thermore, our information about g t and our information about /A may be 
coupled in some way, e.g. by knowledge of E^^t [g^ {X)}. Therefore, we now 
consider admissible sets of the form 


A C 



g: X y is measurable 
and y E A4i(X) 




quantities of interest of the form Q(g,ii) = E x~n[q(X, g(X))\ for some 
measurable function q : X x 3^ M, and seek the extreme values 


Q(A) := inf E X ~ M [q{X, g{X))} and Q(A) := sup E x ~n[q{X, g(X))]. 
(g^eA (g,n)eA 


Obviously, if for each g: X -A y the set of (i E Adi(X) such that (g,g) E A 
is a moment class of the form considered in Theorem 14.19, then 


ext Ex~ M fe(X,#(X))] = e xt E x~MX,g(X))]. 

(g,ii)eA (g,AeA 

M^0f = i A N +N k (Xk) 

In principle, though, although the search over (i is finite-dimensional for each 
g, the search over g is still infinite-dimensional. However, the passage to 
discrete measures often enables us to finite-dimensionalize the search over g, 
since, in some sense, only the values of g on the finite set supp(/x) ‘matter’ 
in computing E x ~ii[q{X,g(X))\. 

The idea is quite simple: instead of optimizing with respect to g E Q, we 
optimize with respect to the finite-dimensional vector y = g | sup p(/x) • However, 
this reduction step requires some care: 

(a) Some ‘functions’ do not have their values defined pointwise, e.g. ‘func- 
tions’ in Lebesgue and Sobolev spaces, which are actually equivalence 
classes of functions modulo equality almost everywhere. If isolated points 
have measure zero, then it makes no sense to restrict such ‘functions’ to 
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a finite set supp(/x). These difficulties are circumvented by insisting that 
Q be a space of functions with pointwise-defined values. 

(b) In the other direction, it is sometimes difficult to verify whether a vector 
y indeed arises as the restriction to supp(/x) of some g <E Q; we need 
functions that can be extended from supp(/x) to all of X . Suitable ext- 
ension properties are ensured if we restrict attention to smooth enough 
functions between the right kinds of spaces. 

Theorem 14.22 (Minty, 1970). Let (X,d) be a metric space, let H be a 
Hilbert space, let E C X, and suppose that f : E H satisfies 

\\f(x)-f(v)\\n<d(x,y) a forallx,yeE (14.6) 


with Holder constant 0 < a < 1. Then there exists F: X — >• H such that 
F\e = f and (14.6) holds for all x, y E X if either a < \ or if X is an inner 
product space with metric given by d(x,y) = k}' a \\x — y || for some k > 0. 
Furthermore, the extension can be performed so that F{X) C co (f(E)), and 
hence without increasing the Holder norm 


|/||co.- 


sup \\f(x)\\ H + sup 

x x^y 


fix) - f(y)\\n 
d(x, y) a 


where the suprema are taken over E or X as appropriate. 

Minty’s extension theorem includes as special cases the Kirszbraun- 
Valentine theorem (which assures that Lipschitz functions between Hilbert 
spaces can be extended without increasing the Lipschitz constant) and 
McShane’s theorem (which assures that scalar-valued continuous functions 
on metric spaces can be extended without increasing a prescribed convex 
modulus of continuity). However, the extensibility property fails for Lipschitz 
functions between Banach spaces, even finite-dimensional ones, as shown by 
the following example of Federer (1969, p. 202): 

Example 14.23. Let E C M 2 be given by E := {(1, —1), (— 1, 1), (1, 1)} and 
define / : E M 2 by 

/(( 1,-1)):= (1,0), /((— 1, 1)) := (—1, 0), and /((l, 1)) := (0, \/3). 

Suppose that we wish to extend this / to F: R 2 R 2 , where E and the 
domain copy of M 2 are given the metric arising from the maximum norm 
|| • | |oo and the range copy of M 2 is given the metric arising from the Euclidean 
norm || • 1 1 2 * Then, for all distinct x,y E E, 


x - y ||oo = 2 = ||/(2;) - f(y ) || 2 , 


so / has Lipschitz constant 1 on E. What value should F take at the 
origin, (0,0), if it is to have Lipschitz constant at most 1? Since (0,0) lies 
at || • Hoo-distance 1 from all three points of E, F(( 0,0)) must he within 
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Fig. 14.4: Illustration of Example 14.23. The function / that takes the three 
points on the left (equipped with 1 1 * 1 1 oo ) to the three points on the right 
(equipped with || • H 2 ) has Lipschitz constant 1, but has no 1-Lipschitz exten- 
sion F to (0,0), let alone the whole plane M 2 , since F(( 0,0)) would have to 
he in the (empty) intersection of the three grey discs. 


|| • || 2 -distance 1 of all three points of f(E). However, there is no such point 
of M 2 within distance 1 of all three points of /(E), and hence any extension 
of / to F: M 2 M 2 must have Lip(E) > 1; indeed, any such F must have 
Lip(E) > -^=. See Figure 14.4. 

Theorem 14.24. Let Q be a collection of measurable functions from X to 
y such that, for every finite subset E C X and g: E y , it is possible to 
determine whether or not g can be extended to an element of Q. Let A C 
Q x fAi(X) be such that, for each g G Q, the set of g G Mi(X) such that 
(g,g) G A is a moment class of the form considered in Theorem 14-19. Let 

A* G ®jfe=l AN+N k { X k), 
y is the restriction to supp(/x) of some g G Q , 
and (g, g) G A 

Then, if q is bounded either above or below, Q(A) = Q(Aa) an d Q(A) = 
Q(Aa)- 

Proof. Exercise 14.8. □ 

Example 14.25. Suppose that : [—1, 1] — R is known to have Lipschitz 
constant Lip(^^") < L. Suppose also that the inputs of g ^ are distributed 
according to g^ G A4i([— 1, 1]), and it is known that 

[X] = 0 and [5 f (^)] >m> 0. 

Hence, the corresponding feasible set is 

g : [—1, 1] R has Lipschitz constant < L, 
n G 1]), = 0, and Ex^[ 3 (X)] > m 
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Suppose that our quantity of interest is the probability of output values 
below 0, i.e. q(x, y ) = 1 [y < 0]. Then Theorem 14.24 ensures that the extreme 
values of 

Q(g, aO < 0]] P x~n[g(X) < 0] 

are the solutions of 


2 

extremize: E Wil[yi < 0] 

i= 0 

with respect to: wo,Wi,W 2 E 0 

x 0 , xi , x 2 e [-1, 1] 

2/0, 2/1, 2/2 e R 
2 

subject to: W{ — 1 

2 = 0 
2 

= 0 

2 = 0 
2 

5 > rn 
2 = 0 

I y% -Vj\< L\xi ~ Xj | for i, j G {0, 1,2}. 

Example 14.26 (McDiarmid). The following admissible set corresponds to 
the assumptions of McDiarmid’s inequality (Theorem 10.12): 


-4mcD 



g: X R has £>&[#] < D k , 

Ai = 0f=iMfe GVJi(A’), > 
and Ex-Ufi'U)] = TO J 


Let m + := max{0,m}. McDiarmid’s inequality is the upper bound 


<3(Aicd) := sup P )x[g(X) < 0] < exp 

(<7>m)£*4mcD 



Perhaps not surprisingly given its general form, McDiarmid’s inequality is 
not the least upper bound on P ^ L [g{X) < 0]; the actual least upper bound 
can be calculated using the reduction theorems. The proofs are lengthy, and 
the results are dependent upon K. 

(a) For K = 1, 


772 - 1 - 

dT’ 


if D 1 < 777- -j- , 
if 0 < 777- _)_ < D\. 


Q{A Mcd) 


0, 

1 


(14.7) 
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(b) For K — 2, 


Q(-4mcd) 


0, 

(Di + D 2 -m + ) 2 


1 


4_Di D 2 
m-\- 


max{Di ,Z ) 2 } 


if Di + £>2 < ra+, 

if |Di — D 2 I < < D\ + D 2 , (14.8) 

if 0 < ra + < |Di — Z>2 1 - 


Note that in the third case, min{.Di, D 2 } does not contribute to the least 
upper bound on ¥ fl [g(X) < 0]. In other words, if most of the uncertainty 
is contained in the first variable (i.e. < D 1 ), then the uncertainty 

associated with the second variable does not affect the global uncertainty; 
the inequality T> 2 [g] < D 2 is non-binding information, and a reduction of 
the global uncertainty requires a reduction in D\. 

(c) Similar, but more complicated, results are possible for K > 3, and 
there are similar ‘screening effects’ in which only a few of the diame- 
ter constraints supply binding information to the optimization problem 
for QPImcd)- 


Dominant Uncertainties and Screening Effects. The phenomenon obs- 
erved in the K = 2 solution of the optimal McDiarmid inequality (14.8) 
occurs in many contexts: not all of the constraints that specify A necessarily 
hold as binding or active constraints at the extremizing solution (#*, //*) E A. 
That is, the best- and worst-case predictions for the quantity of interest 
Q{g^ 1^) are controlled by only a few pieces of input information, and the 
others have not just no impact, but none at all! Far from being undesirable, 
this phenomenon is actually very useful, since it can be used to direct future 
information-gathering activities, such as expensive experimental campaigns, 
by attempting to acquire information (and hence pass to a smaller feasible 
set A' C A)) that will modify the binding/active constraints for the previ- 
ous problem, i.e. invalidate the previous extremizer in A and lead to a new 
extremizer in A! . In this way, we hence pass from the optimal bounds given 
the information in A 


Q(A) < Q{g\ P) < Q(A) 

to improved optimal bounds given the information in A' 

Q(A) < Q(A') < Q{g\ p) < Q(A') < Q(A). 
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estimation, in the case in which it is a convex problem. Convex optimization 
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Marshall et al. (2011). 
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that the simplex is compact; the assumption was later dropped by Kendall 
(1962). For linear programming in infinite-dimensional spaces, with careful 
attention to what parts of the analysis are purely algebraic and what parts 
require topology /order theory, see Anderson and Nash (1987). 

The classification of the extreme points of moment sets, and the conse- 
quences for the optimization of measure affine functionals, are due to von 
Weizsacker and Winkler (1979/80, 1980) and Winkler (1988). Theorem 14.19 
and the Lipschitz version of Theorem 14.24 can be found in Owhadi et al. 
(2013) and Sullivan et al. (2013) respectively. Theorem 14.22 is due to Minty 
(1970), and generalizes earlier results by McShane (1934), Kirszbraun (1934), 
and Valentine (1945). The optimal version of McDiarmid’s inequality is given 
by Owhadi et al. (2013, Section 5.1.1). 


14.6 Exercises 


Exercise 14.1. Let V k denote the set of probability measures /i on M with 
finite moments up to order k > 0, i.e. 



||ti G Adi (M) 


x k d fi(x) < oo 


Show that V k is a ‘small’ subset of V k whenever k> i in the sense that, for 
every /i e V k and every e > 0, there exists v E V t \ V k with dxv(/L v ) < £• 
Hint: follow the example of the Cauchy-Lorentz distribution considered in 
Exercise 8.3 to construct a ‘standard’ probability measure with polynomial 
moments of order l and no higher, and consider convex combinations of this 
‘standard’ measure with fi. 
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Exercise 14.2. Suppose that a six-sided die (with the six sides bearing 1 to 
6 spots) has been tossed N 1 times and that the sample average number 
of spots is 4.5, rather than 3.5 as one would usually expect. Assume that this 
sample average is, in fact, the true average. 

(a) What, according to the Principle of Maximum Entropy, is the correct 
probability distribution on the six sides of the die given this information? 

(b) What are the optimal lower and upper probabilities of each of the 6 sides 
of the die given this information? 

Exercise 14.3. Consider the topology T on M generated by the basis of 
open sets [a, 6), where a, b E R. 

1. Show that this topology generates the same a - algebra on R as the usual 
Euclidean topology does. Hence, show that Gaussian measure is a well- 
defined probability measure on the Borel cr-algebra of (R, T). 

2. Show that every compact subset of (R, T) is a countable set. 

3. Conclude that Gaussian measure on (R, T) is not inner regular and that 
(R, T ) is not a pseudo- Radon space. 

Exercise 14.4. Suppose that A is a moment class of probability measures 
on A and that q : A -4 MU {Aoo} is bounded either below or above. Show 
that /I i— >> E nlq] is a measure affine map. Hint: verify the assertion for the 
case in which q is the indicator function of a measurable set; extend it to 
bounded measurable functions using the monotone class theorem; for non- 
negative /i-integrable functions </, use monotone convergence to verify the 
barycentric formula. 

Exercise 14.5. Let A denote uniform measure on the unit interval [0,1] C R. 
Show that the line segment in 2Wi([0, l] 2 ) joining the measures and 

5o ® A contains measures that are not product measures. Hence show that a 
set A of product probability measures like that considered in Theorem 14.19 
is typically not convex. 

Exercise 14.6. Calculate by hand, as a function of t E R, D > 0 and m E R, 

supP x ~ p [X < t], 

A 


where 

A : = ^ A4i(R) 

Exercise 14.7. Calculate by hand, as a function of t E R, s > 0 and m E 


> m, and 
diam(supp(/r)) < D 


sup P — m>st\, 


sup Px~/4|A — m\ > st \ , 

/iGv4 


and 
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where 


A := 


| n e Mi(i) 


Ex^^[X] < m , an d 1 

E x ^[\X-m\ 2 ] < s 2 j • 


Exercise 14.8. 

Exercise 14.9. 

and v G R, 


Prove Theorem 14.24. 

Calculate by hand, as a function of t G K., tyi G K., z G [0,1] 


sup Px^IpPO < t], 
(g,n)eA 


where 


A := < (g,fi) 


g: [0, 1] — > R has Lipschitz constant 1, 
/i G A4i([0, 1]), E x ~v.[g(X)] > m, 
and g(z) = v 


Numerically verify your calculations. 
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